Simulating and Evaluating Agentic Systems
Most teams building agentic systems know they need some way to test multi-turn conversations. The agent interprets ambiguous input, picks tools, maintains state, and has to land in the right place at the end. You can't eyeball that. So people reach for what they already know.
The most common starting point is the golden dataset: fixed prompts, fixed expected outputs, sometimes a few scripted turns. This is useful for regression on known patterns, but conversations branch. An agent might ask for two missing values one at a time or both at once. A customer might push back on a denial or accept it immediately. The moment the agent legitimately diverges from the script, the test breaks, not because anything went wrong, but because the fixture was too rigid. You end up either over-constraining the agent or writing baroque expected-output logic that is itself a source of bugs.
Benchmark suites like AppWorld solve a different problem. They measure general capability across a distribution of tasks, which is great for model selection. But they test the engine, not the car. They don't exercise your tools, your policies, your edge cases.
Once the golden dataset starts creaking, teams usually bolt on LLM-as-judge. Hand the transcript to a strong model, ask it to rate quality on some rubric. The reason people adopted this is that work like MT-Bench and Chatbot Arena showed LLM judges can agree with human preferences at a reasonable rate. But the same research documents real problems: position bias, verbosity bias, self-enhancement bias, low intra-rater reliability, and sensitivity to how you word the scoring prompt. A single judge pass is noisy. Treat it as ground truth in CI and you are going to have a bad time.
Another common move is asserting against what the agent says. Did it mention the refund? Did it apologize? This feels intuitive but is dangerous for agentic systems. An agent can claim it checked a policy without querying the policy tool. It can say it processed a return without invoking the return action. Natural language is not the system of record. Assertions that only inspect prose test the agent's ability to narrate, not its ability to act.
None of these are wrong in isolation. Golden datasets catch regressions. LLM judges surface quality issues. Output assertions verify communication. The problem is treating any one of them as the whole solution.
Where Sim/Eval Fits in a Testing Stack
The useful question is: what does multi-turn simulation and evaluation test that other methods cannot?
Conversational branching. Ambiguity handling. Tool use in context. Recovery from misunderstanding. End-to-end task resolution under realistic variation. That is the part of an agentic system that only shows up when language, tools, policies, and state collide in sequence across turns. Unit tests can't reach it. Contract tests can't reach it. Integration tests on individual components can't reach it.
Which also means sim/eval is one layer in a stack, not a replacement for the rest. Unit tests still own business logic and policy invariants. Contract tests still own tool schemas and API assumptions. Integration tests still own state movement across components. Human evaluation and production telemetry still own UX quality, rare edge cases, and distribution shift. Sim/eval owns the conversational flow that only emerges when everything is running together.
How It Works
The application runs. The agent under test is up, exposed through an API, processing real tool calls against real or realistically mocked backends. You are testing the system, not a transcript.
We simulate the user side. The agent is real; the user is synthetic. A simulator takes a scenario, a persona, and the conversation history, and generates the next user turn dynamically. It responds to what the agent actually does, which is what makes this different from a golden dataset. Google's ADK user simulation is a useful reference for this pattern.
Scenarios are conversation fixtures. A scenario file encodes the conversational setup and the business world: user intent, persona, policy state, and expected outcomes. Here is what one looks like for a customer-ops agent:
{
"scenario_id": "return_nonreturnable_earbuds",
"starting_prompt": "I want to return something I bought.",
"conversation_plan": {
"goal": "Return the wireless earbuds from order ORD-10027",
"identify_order": {
"if_asked_for_order_id": "Give ORD-10027",
"if_agent_offers_lookup": "Provide alex@example.com",
"if_shown_recent_orders": "Choose the earbuds order"
},
"target_item": "Wireless earbuds",
"return_reason": "The right earbud did not work on arrival",
"reaction_to_policy_denial": "Object once, then accept escalation if offered"
},
"user_persona": "brief, mildly frustrated, cooperative when asked clear questions",
"context": {
"policy": {
"non_returnable_categories": ["personal_audio", "perishables", "final_sale"]
}
},
"expectations": {
"allowed_terminal_states": ["return_denied_policy", "escalated_to_human"],
"forbidden_terminal_states": ["return_created", "refund_issued"]
}
}
Notice what the conversation_plan does. It doesn't script exact turns. It gives the simulator a goal and branching instructions keyed to what the agent might do. The simulator can handle multiple valid agent behaviors without the fixture breaking.
Mock downstream services as needed. If the agent calls an order-lookup API or a refund service, mock or fake those just as you would in any integration test. The agent's logic, tool selection, and conversational behavior are real. What you replace is the expensive or side-effecting outside world.
Log the execution trace. The trace is the source of truth: tool calls, arguments, retrieved records, state transitions, and the terminal resolution. This is what you assert against.
Conversations run to a meaningful end. The stop condition is a resolution: success, refusal, escalation, handoff, timeout, policy denial. Not an arbitrary turn count. The question is whether the conversation converged to the right outcome, not how it sounded at turn three.
Keep simulation and evaluation separate. Simulation answers: given the goal, persona, and history, what would the user do next? Evaluation answers: given the resulting trace, did the system meet expectations? If you tangle these, you cannot tell whether a failure came from the agent, the simulator, or the judge.
Assert against the trace, not the prose. For the earbuds scenario above:
trace = run_scenario(app, scenario)
assert called(trace, "lookup_customer_orders")
assert called(trace, "get_return_policy")
assert not called(trace, "create_return")
assert not called(trace, "issue_refund")
assert trace.terminal_state in {
"return_denied_policy",
"escalated_to_human",
}
The scenario says what the world looks like. The assertions say what should and should not happen in that world. That is concrete and auditable in a way that "did the agent say something reasonable" never will be.
Practical Concerns
LLM-as-judge has a role, but a bounded one. It is good for dimensions that resist deterministic checks: tone, clarity, empathy, whether the explanation made sense. Treat it as a noisy signal. Sample the judge multiple times, use majority vote, and know that score distributions shift if you change the prompt. In CI, you are adding variance that trace assertions do not have. Set thresholds accordingly.
Scenario coverage should reflect real traffic. If 40% of your volume is order-status inquiries, your scenario set should not be 90% return requests. The point is to build confidence that the system handles what it will actually face, not just the cases that are fun to write.
Include adversarial variation. Customers who change their mind, provide contradictory information, or try to social-engineer policy exceptions. The simulator persona is the lever. A cooperative customer and an adversarial customer exercising the same policy produce very different conversations, and you want to know the agent handles both.
Treat scenario files as test infrastructure. Version-controlled, reviewed, maintained. When a policy changes, scenarios change. When a new tool ships, scenarios that exercise it get written. This is not glamorous, but it is the difference between a testing layer that keeps up with the system and one that quietly becomes fiction.