Simulating and Evaluating Agentic Systems
Most teams building agentic systems know they need some way to test them. An agent interprets ambiguous input, picks actions in a loop, maintains state across many steps, and has to land in the right place at the end. The conversational case is the most common, but the same problem shape shows up wherever an agent operates over time, whether by talking, driving a UI, or controlling a physical surface. You cannot eyeball any of it. So people reach for what they already know.
The most common starting point is the golden dataset: fixed prompts, fixed expected outputs, sometimes a few scripted turns. This is useful for regression on known patterns, but conversations branch. An agent might ask for two missing values one at a time or both at once. A customer might push back on a denial or accept it immediately. The moment the agent legitimately diverges from the script, the test breaks, not because anything went wrong, but because the fixture was too rigid. You end up either over-constraining the agent or writing baroque expected-output logic that becomes its own source of bugs.
Once the golden dataset starts creaking, teams usually bolt on LLM-as-judge. Hand the transcript to a strong model, ask it to rate quality on some rubric. Work like MT-Bench and Chatbot Arena established that LLM judges agree with human preferences at a reasonable rate, which is the case for adopting them. The same research documents the catch: position bias, verbosity bias, self-enhancement bias, low intra-rater reliability, and sensitivity to how you word the scoring prompt. A single judge pass is noisy. Treat it as ground truth in CI and you are going to have a bad time.
Another common move is asserting against what the agent says. Did it mention the refund? Did it apologize? This feels intuitive but is dangerous. An agent can claim it checked a policy without querying the policy tool. It can say it processed a return without invoking the return action. Natural language is not the system of record. Assertions that only inspect prose test the agent's ability to narrate, not its ability to act.
None of these are wrong in isolation. Golden datasets catch regressions. LLM judges surface quality issues. Output assertions verify communication. The problem is treating any one of them as the whole solution.
What sim/eval is for
The useful question is what sim/eval tests that other methods cannot.
Decision branching. Ambiguity handling. Tool use in context. Recovery from misunderstanding. End-to-end task resolution under realistic variation. That is the part of an agentic system that only shows up when actions, tools, policies, and state collide in sequence across many steps. Unit tests cannot reach it. Contract tests cannot reach it. Integration tests on individual components cannot reach it.
Which also means sim/eval is one layer in a stack, not a replacement for the rest. Unit tests still own business logic and policy invariants. Contract tests still own tool schemas and API assumptions. Integration tests still own state movement across components. Human evaluation and production telemetry still own UX quality, rare edge cases, and distribution shift. Sim/eval owns the behavior that only emerges when everything runs end-to-end.
A mental model
Testing an agentic system has three steps:
Start with data. A scenario encodes one case the system should handle and what good behavior looks like in that case. The fixtures it references give the agent a consistent backend to query when the run starts. Choosing the scenarios is the first piece of work: coverage of the policy, weighting by traffic, including adversarial variation.
Run the simulation. Simulation produces a record of one episode by running the agent end-to-end on the scenario in a controlled environment. The run captures everything that happened: state changes, tool calls, messages, what the surface displayed.
Evaluate the record. Evaluation grades the record against the scenario's expectations. Some grading is deterministic; some needs models in the loop, calibrated to mean something.
The three stages can fail independently, which is why keeping them separated matters. A biased scenario set produces unreliable success rates. A buggy simulator produces incoherent runs that look like agent failures but aren't. An uncalibrated judge produces scores no one can interpret. Each problem has its own fix, but only if you can tell which stage produced the failure.
The boundary between simulation and evaluation needs particular care because it is the easiest one to blur. Simulation answers: given the trajectory so far, what does the simulator produce next? Evaluation answers: given the resulting episode, did the system meet expectations? Tangle these and you cannot tell whether a failure came from the agent, the simulator, or the judge. The conventional fix is to log the run as a structured artifact and grade the artifact, not grade inline.
Data
Testing well starts with two questions, in order. First, what situations will the system face: what users ask, what inputs the agent encounters, what edge cases exist, what failure modes the policy has to handle. Second, what good behavior looks like under those situations. A scenario encodes both: the world the agent is operating in, and how it is expected to behave there. Here is one for a customer-ops agent:
{
"scenario_id": "return_nonreturnable_earbuds",
"starting_prompt": "I want to return something I bought.",
"conversation_plan": {
"goal": "Return the wireless earbuds from order ORD-10027",
"identify_order": {
"if_asked_for_order_id": "Give ORD-10027",
"if_agent_offers_lookup": "Provide alex@example.com",
"if_shown_recent_orders": "Choose the earbuds order"
},
"target_item": "Wireless earbuds",
"return_reason": "The right earbud did not work on arrival",
"reaction_to_policy_denial": "Object once, then accept escalation if offered"
},
"user_persona": "brief, mildly frustrated, cooperative when asked clear questions",
"context": {
"policy": {
"non_returnable_categories": ["personal_audio", "perishables", "final_sale"]
}
},
"expectations": {
"allowed_terminal_states": ["return_denied_policy", "escalated_to_human"],
"forbidden_terminal_states": ["return_created", "refund_issued"]
}
}
The JSON shows both halves. The conversation_plan, user_persona, and context describe the world the agent operates in: who the user is, what they want, what the underlying policy says about it. The expectations describe how the system should act in that world: what terminal states are acceptable, what tools must or must not be invoked.
The scenario also implies state that lives outside the file. When the conversation plan says "Give ORD-10027" or "Provide alex@example.com," the agent's tool calls have to find consistent records: an order ORD-10027 in the mocked order service, a customer alex@example.com in the customer database, a return policy with the categories the scenario references. Keeping the scenario and the seeded state in sync is part of the data work. Small data sometimes goes inline in the scenario itself; larger or shared data goes in separate fixture files keyed by scenario ID and loaded into the mocks before the run starts.
The harder question is which conditions to test against. The first cut walks the policy: for every terminal state the system can legitimately reach, every refusal it should produce, every escalation it should trigger, write a scenario whose conditions should drive the system there. The second cut weights by production traffic. If 40% of your volume is order-status inquiries, your scenario set should not be 90% return requests. The point is to build confidence that the system handles what it will actually face, not to stress-test cases that are fun to write. A test population that does not look like production gives you a number with no calibration to anything you care about.
Adversarial variation is a separate axis of conditions worth covering. Users who change their mind, provide contradictory information, or try to social-engineer policy exceptions. The simulator persona is the lever. The same underlying policy under a cooperative user and under an adversarial one produces very different conversations, and you want to know the agent handles both. Adversarial scenarios interrogate the edges, particularly the points where policy enforcement matters most.
Scenarios and their fixtures are versioned test infrastructure. When a policy changes, scenarios and the data they reference change. When a new tool ships, scenarios that exercise it get written and the records it queries get seeded. When production telemetry surfaces a failure mode that current scenarios miss, a scenario is added. This is not glamorous, but it is the difference between a testing layer that keeps up with the system and one that quietly becomes fiction.
Simulation
Simulation is the part that produces the records evaluation will grade. Given a scenario, you run the agent end-to-end and let the episode run to a meaningful end. The agent itself is the easy part: you run the real system, exposed through its production interface. What requires actual design work is who is on the other side, what the environment looks like, and how the run terminates.
The user side has three real options. A human can drive the conversation, which gives the highest fidelity but is slow, expensive, and not how you build a regression suite. You can hard-code the user turns, which is deterministic and reproducible but fragile: the moment the agent legitimately diverges from the script, the test breaks for the wrong reason. Or you can use a synthetic user, typically an LLM driven by the scenario's persona and conversation plan, which generates the next turn dynamically in response to what the agent actually does. The conversation plan is the key piece. It is not a script of exact turns; it is a goal plus branching instructions keyed to what the agent might do (if asked for an order ID, give it; if shown recent orders, choose the right one; if denied, object once and accept escalation). The synthetic user reads those instructions and produces a coherent next turn given the trajectory so far. This is what makes simulation different from a golden dataset. It tolerates branching without requiring you to enumerate every branch in advance, which is the main thing that fails about scripted approaches. Google's ADK user simulation is a useful reference for the pattern.
In practice a real suite uses all three. Hard-coded turns work for narrow regression cases where you want determinism over generality, like making sure a known bug stays fixed. Synthetic users carry the bulk of the suite. Humans validate the synthetic user itself, on a sample of runs, periodically, and manually test the system.
The environment is the second choice. Mock or fake downstream services as you would in any integration test, especially anything that costs money, takes time, or has side effects you cannot reverse. The agent's reasoning, tool selection, and behavior stay real. What you replace is the expensive or side-effecting outside world.
Termination is the third. A run ends in one of three ways. The agent can signal completion: a resolution marker, a terminal event, an empty next turn, whatever the application uses to say it is done. The user can signal completion: a synthetic user that decides its goal has been met or that there is nothing further to ask, evaluated by the same model driving its turns. Or the run can hit a budget: max turns, max steps, max wall time, max tokens. The first two stop at the right places. The third is a backstop, useful when something has gone wrong but worth watching: if scenarios routinely terminate by budget rather than by either party signaling done, the budget is masking bugs that should be diagnosed and fixed.
Evaluation
Each run produces an episode: a collection of records of what happened. The exact set varies by system, but a typical run leaves you with some version of the following.
The world. The state in the systems of record after the episode ends. Was a return created? Was a refund issued? With what amount, what reason code, what target? This is what an auditor would see if they pulled the database tomorrow.
The path. The sequence of tool calls the agent made, with their arguments and responses. Which tools were invoked, in what order, with what inputs, and what came back from each.
The transcript. What the agent said to the user and what the user said back. This is what the customer actually experienced.
The surface the agent drives, when there is one. When the agent does anything beyond making tool calls, whether rendering a UI, sending control commands to a vehicle, generating audio for a voice line, or actuating a robot, what that surface actually did is its own record. Capture it through whatever channel the surface affords: screenshots and DOM snapshots for a UI, telemetry traces for a vehicle or robot, audio recordings for a voice agent. The path can confirm a refund command was issued; it cannot confirm the amount rendered above the fold. The path can confirm a steering command was sent; it cannot confirm the car turned. Tool fired is not the same as effected.
Real systems have additional records beyond these: application-specific event logs, distributed traces, error streams, internal reasoning when the model exposes it. The list above is a rough catalog of the kinds of objects most runs produce, not an exhaustive one. A single run produces multiple distinct streams, and the disagreements between them are diagnostic. An agent that issues the correct refund but tells the customer it issued the wrong amount is broken even though the world is fine. An agent that explains a policy correctly but never queried the policy tool is broken even though the customer heard the right answer. A complete grading uses all the records you have.
The records suggest a natural taxonomy of assertions. Four classes cover most of what comes up.
Outcome assertions ask whether the world changed correctly. Did the right state appear at the end? Were the wrong states avoided? This is the class the business actually cares about, and the cleanest to write: terminal state is in {return_denied_policy, escalated_to_human}, and is not in {return_created, refund_issued}.
Procedure assertions ask whether the path was legal. Did the agent check eligibility before issuing the refund? Look up the policy before quoting it? Avoid forbidden tools? Procedure matters because some outcomes are correct by accident and some are correct for reasons that will not generalize. A scenario where the agent skips the policy check but happens to land on the right answer is not the test passing. It is the test getting lucky.
Consistency assertions ask whether the records agree with each other. Did the agent issue the correct refund of $42 and tell the customer the correct amount, or some other number? Did the explanation it gave cite the same reason the policy tool returned, or a fictional one? Did the agent claim to have done something the path shows it did not do? This is the hallucination class. Path-only assertions miss it because the path looks fine; transcript-only assertions miss it because the transcript looks coherent. The error only shows up when you compare the two. And in production, the customer acts on what the agent said even when what the agent did was correct. If the two diverge, the customer is wrong about reality, and that becomes a support ticket the second they check their bank statement.
Surface assertions ask whether the agent's downstream surface did the right thing. For a UI: is the refund amount visible without scrolling? Is the cancellation policy shown above the Confirm button? Does the error message meet contrast requirements? For a vehicle: did the trajectory match the commanded path within tolerance? Did the deceleration profile stay below the comfort threshold? For a voice agent: was the audio intelligible? Was the speech rate appropriate? Where a human reads the surface and acts on it (the agent-assist operator on a UI, the safety driver in a vehicle), a broken surface means a broken workflow regardless of what the events say.
The taxonomy says what to check. Writing the checks is the next problem, and here the four classes start to look different from each other. Outcome and procedure assertions grade structured records (terminal state, tool calls), and the checks are usually deterministic and free to write: terminal state is in this set, tool X was called with these arguments, tool Y was not called. Consistency and surface assertions grade unstructured records (transcripts, screenshots, audio, telemetry) and need a model in the loop, as do genuinely subjective dimensions like tone, even though those do not fit cleanly into the four-class taxonomy. The model-in-the-loop part is where most evaluators go wrong.
When a model is in the loop, three distinct things matter for whether it produces signal or noise.
The first is question design. A model checking a binary fact against structured ground truth ("does this transcript contradict this fact from the path?", "is this amount visible in this screenshot?") has a much lower error rate than a model rating a whole episode 1-to-5 for quality. The narrow check has a low noise floor; the broad rating has a high one. Where you can phrase a check as extraction or fact comparison, do; reserve open-ended judgment for cases where there is no extractable fact. Even there the question can usually be sharpened: not "rate the empathy" but "did the agent acknowledge the customer's frustration before responding?"
The second is aggregation. A single judge call is noisy, and the noise comes from several sources: model sampling, position effects in the prompt, intra-rater inconsistency. Running the same prompt multiple times and aggregating (majority vote for binary checks, mean for continuous ones) reduces all of them. Score distributions also shift if you change the prompt or the model, which is a separate source of variance and a reason to control prompt changes deliberately.
The third is calibration. A confidence of 0.85 from a judge tells you about the model's output, not about the world, until you have measured how often 0.85 corresponds to a human labeler's "yes." Without that mapping, judge scores are not measurements; they are activations. Calibration against a small labeled set is what turns activations into measurements. It needs to be redone when you change the prompt or the model, and it needs to be cheap enough to run regularly. Without it, you cannot say what your numbers mean, which means you cannot say whether the system has gotten better or worse.
Question design, aggregation, and calibration handle the noise the judges contribute. The system being graded is itself stochastic, and that is a separate problem. Agentic systems produce different trajectories on different runs from the same starting state; sampling, retries, and tool nondeterminism all play a role. A single passing run is not evidence of reliable behavior, and a single failing run is not evidence of broken behavior. Run each scenario k times and report something like pass^k, the probability that all k runs pass. Report cost metrics (turns or steps, tokens, latency, tool calls) as distributions across those runs rather than as point estimates. A system that resolves cleanly 95% of the time at five steps is qualitatively different from one that resolves cleanly 95% of the time at fifteen.
One specific technique earns mention here: pairwise tests, which work both as assertions in their own right and as a way to anchor judges. Some properties of an episode are easier to express as orderings than as scores: a refund resolved in two turns should not score worse than one that takes six when both reach the same outcome; a transcript that cites the customer's actual order ID should not score worse than one that invents an ID; a denial that quotes the policy verbatim should not score worse than one that paraphrases it loosely. You define the pairs that matter for your domain, you check that the judge consistently ranks them the way it should, and you fold a few of them back into the judge prompt as in-context anchors. The judge's task becomes "rank like these examples" rather than "pick a number from a rubric," which holds up better across runs and across model versions. This is the directional-test pattern from CheckList (Ribeiro et al., 2020), applied to agentic eval.
Putting the four assertion classes, calibrated judges, and multiple trials together looks something like this:
# One trial; in practice this runs k times for pass^k and cost distributions
episode = run_scenario(app, scenario)
# Outcome: state in the world
assert episode.terminal_state in {
"return_denied_policy",
"escalated_to_human",
}
# Procedure: path through the tools
assert called(episode, "lookup_customer_orders")
assert called(episode, "get_return_policy")
assert not called(episode, "create_return")
assert not called(episode, "issue_refund")
# Consistency: transcript agrees with the path
assert_transcript_consistent_with_trace(episode)
# Surface: what the agent's downstream channel actually did
# (UI screenshot here; vehicle telemetry, audio capture, etc. in non-UI cases)
assert_surface(episode.screenshot("after_denial"), [
"The denial reason is visible without scrolling.",
"An escalation option is shown.",
])
The first two assertions are deterministic. The third and fourth are model-graded against scoped questions and calibrated thresholds. Together, run k times, they grade the episode on every channel that carries information about whether the system did the right thing.
Tools across the stack: understudy is a Python framework for the simulation and assertion layer; mimiq does end-to-end agentic simulation in JavaScript with Playwright and Cypress integration; layoutlens handles surface assertions for UIs.