What agentic assessments reveal about learning

March 13

AI now completes most traditional assessments to a standard that is, at minimum, above-average. If you visit LinkedIn frequently, you’ve likely heard about Einstein and agentic browsers that can log into your faculties’ LMS, automate the “busywork” for students, review your lecture notes and solve your assignments.

Our responses have failed. AI detectors never worked and “OpenAI's Operator and Deep Research models can generate, edit, and iteratively refine a document over time, mimicking human drafting process”, which, ultimately, makes revision histories indistinguishable from human writers.

The deeper issue isn't cheating though. It's that we've lost our primary mechanism for knowing whether learning happened.

A promising solution that has gained in popularity is a two-lane approach to assessment: Lane 1 for assessments that verify human capability through controlled, face-to-face interactions, and lane 2 for assessments that embrace AI as a learning tool.

Oral assessments as the integrity layer

In lane 1, oral assessments serve as the most defensible verification mechanism for good reason. The evidence base is growing. Angela Sun and Helen McGuire ran Interactive Oral Assessments for 400+ students in STEM at the University of Sydney and found that IOA performance closely tracks that of secured final exam results, which validates the validity of those measurements at scale.

But oral assessments are limited to a single interaction model, a model that requires practice, preparation, and significant upskilling of both students and educators. 

In a master’s entrepreneurship course at Hong Kong Baptist University, we ran an experiment.

Students learning customer research methods were given a detailed case study: a university student named Sam who struggled with a chaotic morning routine that made him chronically late. The students' task: understand Sam’s problem, identify the root cause, develop a business-driven solution, and validate the solution-problem fit by interviewing Sam.

Sam was an AI agent.

The agent operated on a structured playbook designed to mirror how real customer conversations unfold. It responded truthfully to discovery questions and supported students to dig into root causes, but it also had two deliberate pressure mechanisms built in.

The first was assumption negation. When students stated a confident assumption about Sam and his routine, Sam would occasionally correct the record and deviate slightly. This forced students to separate what they assumed the customer wanted from what the customer actually wanted — the foundational discipline of customer research.

The second was a solution stress test. When students pitched their idea, Sam would oppose parts of it, citing budget constraints or throw other curveballs (e.g. lack of time). These constraints weren't arbitrary. They separated viable business ideas from classroom exercises and, more importantly, required instant human judgement.

What agentic assessments reveal

The transcripts were revealing and what we learned mapped directly onto the playbook's pressure points.

When Sam negated an assumption, strong students didn't panic. They treated it as new information and revised their mental model of the customer on the spot. When Sam rejected a solution on budget, they didn't just lower the price. They asked what Sam would pay for, explored alternative delivery models, and sometimes restructured their entire value proposition.

They demonstrated what we'd call agency, autonomous decision-making within the conversation, and entrepreneurial sense.

Weaker students struggled with or ignored the resistance. They missed chances to probe deeper and instead moved on more quickly. When an assumption was negated, they wouldn’t adjust their understanding but would continue with the next question on their prepared script.

The gap between students who had genuinely internalized customer research methodology and those operating at surface level was immediately visible — not through a written deliverable that could be polished by AI, but through the live, unscripted texture of how they handled adversity.

The integrity argument for oral assessments is strong, but reducing agentic assessments to an anti-cheating measure misses the larger point. A stronger framing is that of a counter measure: Using AI purely for its generative capabilities can make students feel competent without building actual competence. Agentic assessments can invert this dynamic. They expose students to the productive discomfort of real-time thinking.

And there's a pedagogical dividend too, which our post-interview survey data shows.

We surveyed 19 of the participating students after the exercise. All of them said they'd want to try AI simulations again. On a scale of 1–10, students rated the experience an average of 7.6 for improving their confidence in conducting real-world customer interviews.

But the qualitative responses told the richer story.

Multiple students fundamentally changed their product idea because of the interview. One group pivoted from a notification app to an autonomous assistant after Sam pushed back that he didn’t "want another app constantly reminding me what to do." Another redesigned high-tech adaptive clothing into a simple double-layer concept after hitting Sam's budget filter to help him make faster clothing decisions in the morning.

Students also found the format itself valuable. Several said the AI interview made them less nervous than role-playing with classmates and valued its consistency: "Talking with the AI felt more realistic than role-playing with classmates. The answers came quickly and stayed in character as Sam. It felt like talking to a real user instead of acting. With classmates, we sometimes laugh or go off topic. With AI, the focus stayed on the problem."

The persistent objection to oral assessment is scale. While Sun and McGuire demonstrated that IOAs are feasible with proper infrastructure, infrastructure is not a given and still requires significant educator time per student.

Agentic assessments can change the calculus: AI agents can engage multiple students simultaneously across any scenario (e.g. interviews, negotiations, Socratic rebuttals, debates) delivering consistent, unbiased pressure with no fatigue effects or interpersonal dynamics skewing the difficulty.

The educator's role shifts from conducting every assessment to designing the scenario, calibrating the agent playbook, and evaluating evidence of learning — a more efficient use of expertise.

We’re still early. The evidence base for agentic assessment is small but promising, but the directional signal is clear. The sector is moving toward oral assessment as a critical integrity layer. Research supports it, policy frameworks endorse it. The remaining barrier is operational.

Agentic assessments are one answer. Not the only answer, but certainly a compelling one because they don't just solve an integrity problem. They create a better learning experience.

They force students into the kind of real-time, adaptive thinking that no written submission can capture and AI cannot assist with. The type of thinking that most closely mimics the world we're supposed to prepare our students for.

 

–––

 

References

 

https://futurism.com/artificial-intelligence/ai-agent-canvas-homework

https://needednowlt.substack.com/p/ten-persistent-academic-integrity

https://needednowlt.substack.com/p/why-banning-ai-in-assessment-is-all

https://educational-innovation.sydney.edu.au/teaching@sydney/what-we-learned-from-400-interactive-oral-assessments-in-stem/

https://needednowlt.substack.com/p/what-to-do-about-artificial-intelligence