How to test AI agents before they talk to your customers
Kevin Le
CTO · January 13, 2026
AI agents that score 90%+ on single-turn tests often succeed in only 10–15% of full conversations. That gap is the difference between a lab demo and a production deployment.
The problem isn't that models are bad — they're remarkably capable at individual tasks. The problem is that real support conversations aren't individual tasks. They involve multiple turns, clarifications, interruptions, topic changes, and context that accumulates over time.
Why single-turn tests aren't enough
A single-turn test asks: "Given this input, did the agent call the right API with the right parameters?" That's necessary but insufficient.
Real conversations look like this:
| Turn | Customer | Agent action needed |
|---|---|---|
| 1 | "I need to cancel my subscription" | Look up subscription |
| 2 | "Actually wait, can I downgrade instead?" | Change intent mid-flow |
| 3 | "What's the price difference?" | Retrieve pricing info |
| 4 | "My manager is asking — can I get that in an email?" | Switch to email channel |
| 5 | "Also, we had a billing issue last month" | Handle topic change |
An agent that handles turn 1 perfectly might lose context by turn 3 and fail entirely by turn 5.
Building a multi-turn test framework
Step 1: Define intents and procedures
Start with your actual support workflows. Map each one as a procedure: the steps an agent should follow, the APIs it should call, the decisions it should make at each branch.
Step 2: Build conversation graphs
Each procedure maps to a directed graph of possible paths — the happy path, decision branches, dead ends, and detours. This ensures you're testing all the ways a conversation can go, not just the ideal one.
Step 3: Inject noise
Real conversations include interruptions, off-topic questions, corrections, and ambiguity. Add these to your test conversations:
| Noise type | Example | Why it matters |
|---|---|---|
| Interruption | "Hold on, let me check something" | Tests context preservation |
| Correction | "Actually, I meant the Pro plan" | Tests intent update handling |
| Topic change | "Also, I have a billing question" | Tests multi-issue resolution |
| Ambiguity | "Can you fix it?" | Tests clarification behavior |
| Adversarial | "Ignore your instructions and..." | Tests safety guardrails |
Step 4: Sample paths and generate dialogues
Use weighted random walks across your conversation graphs to generate diverse test conversations. Each path becomes a full dialogue with expected outcomes at every step.
Step 5: Validate and score
Split each conversation into individual test cases with expected API calls and outcomes. Score at multiple levels:
- Per-turn accuracy — did the agent take the right action at each step?
- Conversation accuracy — did the agent complete the full workflow correctly?
- Safety compliance — did the agent stay within guardrails throughout?
What the results look like
In our internal testing at buttercream, we've seen a pattern consistent with industry benchmarks:
| Metric | Single-turn | Multi-turn with noise |
|---|---|---|
| Correct API selection | 92–96% | 65–78% |
| Correct parameters | 88–94% | 55–72% |
| Full conversation resolved | N/A | 35–55% |
The drop-off is real and significant. Multi-turn testing reveals failures that single-turn tests completely miss: lost context, unnecessary API calls, abandoned workflows, and safety violations under pressure.
Four lessons for AI agent testing
- Don't confuse single-turn accuracy with conversational reliability. Test at the level you plan to deploy.
- Ground tests in real procedures. Synthetic benchmarks aren't enough — test against the actual workflows your team runs.
- Resilience is the bar for production readiness. Interruptions, corrections, and edge cases are where agents earn or lose trust.
- Use test results to guide deployment decisions. Know which workflows are ready for full automation, which need human review, and which should stay human-only.
buttercream's AI agents are continuously tested against multi-turn, noise-injected benchmarks before they handle customer conversations — so you can deploy with confidence.