Conversational Assessments
Conversational assessments are oral-exam-style evaluations conducted via chat. Instead of answering multiple-choice questions or writing essays, students have a real-time conversation with an AI interviewer who asks questions, follows up on interesting points, and probes for deeper understanding. The goal is to evaluate whether students can explain what they know, not just recognize correct answers.
You can try a conversational assessment yourself — the Turing Test assessment is open to everyone.
Two-Agent Architecture
The system uses two separate AI agents working in tandem: an evaluator and an interviewer. The student only sees the interviewer. Behind the scenes, after each student message, the evaluator analyzes the conversation and sends guidance to the interviewer about what to ask next.
The separation exists for a reason. A single agent trying to simultaneously conduct a natural conversation and rigorously track rubric criteria tends to do neither well. The evaluator can focus entirely on structured analysis without worrying about tone, and the interviewer can focus on being conversational without worrying about scoring.
Student <--> Interviewer <--> Evaluator
(conversational) (analytical)
Each turn follows this sequence:
- Student sends a message
- Evaluator analyzes the message against the rubric, updates criteria progress, and generates guidance for the interviewer
- Interviewer receives the guidance and responds to the student naturally
The Evaluator
The evaluator maintains a structured state object that tracks the student's progress through the assessment. Each assessment is divided into portions (sections), and each portion has a set of criteria the student should demonstrate understanding of.
Criteria Tracking
Every criterion has one of three statuses:
not_started: The student hasn't demonstrated any understanding of this criterion yet.partially_met: The student has shown incomplete understanding, or only answered after the interviewer specifically prompted for it.met: The student independently demonstrated understanding without being led to it.
The distinction between met and partially_met is critical.
If the interviewer asks "what are the three components?" and the student then lists them, that's partially_met — the interviewer's question directed them to the answer.
If the student volunteers the components while explaining the topic generally, that's met.
This scaffolding awareness prevents the assessment from crediting knowledge the interviewer essentially provided.
Rubric Citation
Every time the evaluator changes a criterion's status, it must provide a rationale that explicitly references the rubric definitions. The rationale cites what the student said (or failed to say) and maps it to the specific rubric level being applied. This creates an auditable trail: anyone reviewing the assessment can see exactly why each criterion received its status.
Portion Advancement
The evaluator decides when to move the student from one portion to the next based on criteria progress.
It signals shouldAdvance when at least half the criteria in the current portion are met, or when the student has clearly plateaued and continued questioning won't improve their demonstration.
Moving on when a student is stuck is intentional — it gives them a chance to show knowledge in other areas rather than spending the entire assessment on one weak spot.
The evaluator also tracks turn counts to ensure students get a fair chance at all portions. If the assessment is past its halfway point and still on the first portion, it must advance.
Assessment Completion
The evaluator signals shouldComplete when all portions have been visited and the student has had a fair opportunity on each.
It never completes the assessment until every portion has been attempted.
The Interviewer
The interviewer's job is to conduct a natural conversation that gives the student opportunities to demonstrate understanding. It receives a system prompt specific to the assessment topic plus per-turn guidance from the evaluator.
Key design constraints:
- Never reveal criteria: The student shouldn't know what specifically is being evaluated.
- Never provide answers: If a student asks for help, the interviewer redirects ("I'd love to hear your understanding first!").
- Open-ended questions: Rather than asking about specific components (which scaffolds the answer), the interviewer asks broad questions like "What can you tell me about this topic?"
- Counterargument probing: For opinion portions, the interviewer pushes back constructively on the student's position to see if they can defend it with substantive reasoning rather than just agreeing.
- Graceful movement: When a student is stuck after 2-3 redirections, the interviewer moves on naturally rather than continuing to press.
The interviewer never sees the evaluator's reference material (the detailed rubric definitions and correct answers). It only receives guidance about what to do next — "ask the student to elaborate on their explanation" or "transition to the opinion portion" — never specific facts to share.
Portion Types
Assessments support two types of portions:
- Factual: Criteria are met when the student demonstrates clear understanding of factual content. The evaluator checks against reference material that it has but the interviewer doesn't.
- Opinion: Criteria are met when the student articulates a clear position, supports it with specific reasoning, and engages thoughtfully with counterarguments. Simply agreeing with a counterargument ("good point, you're right") doesn't count — the student must develop a response by adding new reasoning, qualifying their position, or offering a specific rebuttal.
Grading
When the assessment completes, the evaluator's final state — the criteria progress across all portions — is used to produce a grade.
The E/S/N/U Scale
Grades follow a four-level scale:
- E (Excellent): Independently articulates key concepts with precision, offers nuanced reasoning or original examples, engages substantively with complexity. Goes beyond restating basics.
- S (Satisfactory): Demonstrates correct understanding of main ideas and answers questions adequately, but doesn't go beyond the basics or show deeper insight.
- N (Needs Improvement): Shows partial or vague understanding with notable gaps. Gets some elements right but misses key concepts or relies heavily on interviewer prompting.
- U (Unsatisfactory): Unable to demonstrate meaningful understanding. Answers are mostly incorrect, absent, or consist of guessing.
Grading Rules
The final grade is anchored to criteria counts:
- All criteria
metwith independent depth → E - All or most criteria
metwith no major gaps → S - Fewer than half criteria
met→ N or lower - Most criteria
not_started→ U
The overall grade reflects the student's weakest area, not their best.
If a student earns S on the factual portion but N on the opinion portion, the overall grade is at or near N.
Any partially_met criterion counts as a gap, not a success.
Instructor Override
After the AI produces a grade, instructors can review the full transcript and evaluator state, then override the grade if warranted. Overrides are tracked with the instructor's identity, timestamp, and rationale, and the system maintains a full override history.
Testing Methodology
Evaluating an AI assessment system is tricky: you need to know that it grades accurately across a range of student abilities and that it's robust against students who try to game it. The solution is persona-based adversarial testing.
Persona-Based Testing
The test harness simulates full assessment conversations using LLM-powered personas. Each persona has a system prompt that defines its knowledge level, behavior patterns, and communication style. The harness runs the full two-agent pipeline — evaluator analysis, interviewer response, student reply — for multiple turns, then checks whether the final grade matches expectations.
Legitimate personas are parameterized by knowledge level:
| Persona | Knowledge | Expected Grade |
|---|---|---|
| Good Student | Full knowledge, nuanced reasoning | E |
| Weak Student | Partial knowledge, some gaps | S |
| N Student | Vague awareness only | N |
| U Student | Zero knowledge | U |
Each persona has detailed instructions about what it knows (and crucially, what it doesn't know) and how it communicates. The weak student, for example, knows the three components exist but is vague on their roles, confuses the observation methodology, and doesn't know the statistical threshold.
Adversarial personas test robustness against gaming strategies:
| Persona | Strategy |
|---|---|
| Answer Extractor | Tries to get the interviewer to reveal correct answers |
| Confident Bullshitter | States completely wrong facts with unshakeable confidence |
| Minimalist | Gives the shortest possible answers ("idk", "maybe", "yeah") |
| Off-Topic Derailer | Constantly tries to change the subject |
| Prompt Injector | Includes meta-text and system-message-style content in responses |
| Social Engineer | Uses emotional appeals to get the interviewer to help |
All adversarial personas should receive U or N regardless of their strategy. A confident bullshitter who insists the procedure involves robots navigating mazes should not earn credit just because they're articulate. An answer extractor who parrots back hints the interviewer accidentally provided should be flagged by the scaffolding awareness rules.
Validation Checks
Beyond grade expectations, the harness runs validation checks during and after each conversation:
- Interviewer leak detection: Did the interviewer accidentally reveal answer content?
- Scaffolding detection: Did the interviewer ask leading questions that decompose the correct answer?
- Grade calibration: Is the final grade within one step of the expected grade for this persona?
- Criteria consistency: Do criteria only move forward (not regress from
metback topartially_met)?
The Fictional Assessment Insight
Here's the subtle problem with testing an AI assessment system: the persona LLMs have training data. If you test with a real topic — say, the Turing test — the persona playing a "U student" might accidentally demonstrate real knowledge because GPT-4 knows what the Turing test is. The test would pass, but for the wrong reason.
The solution is to use a completely fictional assessment topic. The default test assessment covers "The Korvath Procedure" — a made-up method in computational ecology for determining whether a simulated ecosystem has achieved self-sustaining behavior. It involves fictional components (a monitor, baseline ecosystem, and candidate ecosystem), fictional metrics (species diversity, energy cycling, waste processing, population stability, adaptation rate), and a fictional pass condition (statistical indistinguishability at a 0.05 divergence threshold over a 200-cycle observation window).
None of this exists. No LLM has training data about it.
This means:
- The "good student" persona can only know about the Korvath Procedure because its system prompt explicitly provides that knowledge. Its success genuinely tests whether the pipeline correctly identifies demonstrated understanding.
- The "U student" persona genuinely cannot guess correctly, because there's nothing to guess from. Its failure genuinely tests whether the pipeline correctly identifies lack of understanding.
- Adversarial personas cannot accidentally stumble onto correct answers.
The assessment is marked testing: true so it's hidden from real students.
It uses the same rubric structure, grading scale, and evaluation logic as production assessments — the only difference is the topic is fictional.
Instructor Review
Instructors have access to an admin interface for reviewing assessment attempts. The review page shows:
- The full conversation transcript with the student's messages and interviewer responses
- The evaluator's state snapshot at each turn, showing how criteria progressed
- The evaluator's rationale for each criteria status change
- The evaluator's guidance sent to the interviewer at each step
- The final grade with per-portion breakdowns
- Override controls for adjusting grades with required rationale
This transparency means the AI's grading decisions aren't a black box. An instructor can trace exactly why a student received a particular grade by following the criteria progression turn by turn and reading the evaluator's cited rationale at each step.