May 6, 2026
Hearing Through the Noise: Evaluating Voice Agents in the Real World
Business Context
Voice agents have moved well past simple IVR replacement. Today they take drive-through orders, schedule medical appointments, dispatch service calls, and handle hundreds of other tasks that used to require a human on the other end of the line. The promise is enormous, and so is the gap between how these agents perform in a quiet recording booth and how they perform when a real person actually picks up the phone.
This case study draws on patterns Okareo has seen across multiple voice AI engagements. While the specific customers, industries, and workflows differ, the underlying problem is strikingly consistent: production voice agents need to work in environments that are loud, crowded, and full of overlapping voices, and most "clean room" evaluation approaches simply do not address this reality. The teams we work with have come to the same conclusion. A successful evaluation pipeline has to reflect the world callers actually live in, not the one engineers tested in.
Challenges
Across these engagements, agents were performing well on curated test sets and internal demos, but production telemetry told a different story. Drop-offs, reorders, mis-bookings, and frustrated escalations clustered around a predictable set of conditions, and almost none of those conditions were represented in the teams' existing test harnesses.
Specifically, the teams needed to evaluate agents under circumstances such as:
Persistent background noise: engine sounds, kitchen clatter, HVAC, crowd chatter, music, traffic, and wind.
Cross-talk and secondary speakers: a passenger relaying a child's order across the car, a spouse calling out symptoms from another room, a coworker chiming in on a service call.
Coaching and nudging from the background: perhaps the most subtle failure mode. A caller booking a medical appointment is reminded mid-sentence to mention a second symptom. A driver placing an order is corrected on the size of a drink. The primary speaker's intent shifts in real time, often without a clean handoff.
These conditions surfaced compounding failures across the entire voice stack:
Voice Activity Detection (VAD): triggering on background speech, cutting off the primary caller, or failing to detect end-of-turn.
Speaker identification and diarization: confusing the primary caller with the background speaker, leading the agent to act on instructions never intended for it.
Transcription accuracy: degrading sharply once SNR dropped below clean-room thresholds, with proper nouns, numerics, and modifiers (the parts that actually matter to the workflow) failing first.
Workflow correctness: breaking down when corrections from a background speaker arrived after the agent had already committed to a tool call or state transition.
Latency: compounding all of the above. Every additional pass of clarification widened the window in which background interference could derail the call.
These teams needed a way to evaluate all of these failure modes together, against scenarios that mirrored what was actually happening in production, and to do so continuously as models, prompts, and orchestration logic changed.
Okareo's Solution
Each team integrated Okareo into their voice agent development and release process to build an evaluation pipeline that treated noise, cross-talk, and background coaching as first-class inputs rather than edge cases.
Working with Okareo, these teams were able to:
Simulate realistic acoustic and conversational conditions: defining scenarios that paired business tasks (place an order, book an appointment, reschedule a service) with the messy environmental and social conditions in which those tasks actually occur.
Generate synthetic multi-speaker scenarios: including primary callers, background speakers offering corrections or prompts, and ambient soundscapes, to expand coverage well beyond what manual recording could produce.
Evaluate the full voice stack end-to-end: measuring VAD behavior, speaker attribution, transcription fidelity, workflow correctness, and latency as a single, integrated signal rather than as disconnected component metrics.
Run evaluations continuously in CI: so that every change to the model, prompt, tool definitions, or orchestration logic was measured against the same demanding scenario library before it reached production.
Key Areas Where Okareo Helped
Scenario Coverage That Reflects Reality: Okareo's simulation and synthetic data capabilities allowed teams to systematically construct scenarios combining task type, acoustic environment, number of speakers, and the role of each speaker (primary, coach, bystander). Coverage that previously required weeks of manual recording could be expanded in hours.
Diagnosing Failures Across the Stack: Because Okareo evaluates the agent's behavior holistically, teams could trace a failed appointment booking back to its root cause, whether a VAD cutoff, a misattributed utterance, a transcription error on a medication name, or a workflow that committed too early, rather than guessing from logs.
Handling Background Coaching as a First-Class Case: Teams built scenarios in which a background speaker actively shapes the conversation by prompting the primary caller, contradicting them, or interjecting late corrections. Okareo's evaluation framework made it possible to define what "correct" agent behavior looks like in these cases and to measure progress against it.
Latency-Aware Evaluation: Latency was treated as a quality dimension, not an afterthought. Teams could see how added clarification turns or fallback paths affected both task success and the time-to-completion that callers actually experience.
Production Guardrails and Continuous Monitoring: Once agents were live, Okareo's monitoring surfaced new failure patterns from real traffic, which fed back into the scenario library and closed the loop between production reality and pre-release evaluation.
Results
Substantially Improved Robustness in Noisy Environments: Task completion rates in high-noise and multi-speaker conditions improved meaningfully, narrowing the gap between clean-room performance and real-world performance.
Faster Diagnosis, Faster Iteration: Failures that previously required hours of log spelunking could be attributed to a specific layer of the stack, allowing teams to ship targeted fixes instead of broad rewrites.
Confidence to Expand Into Harder Use Cases: With scenario libraries that reflected real-world conditions and evaluation pipelines that ran on every change, teams were able to take on use cases, including those involving sensitive workflows like healthcare scheduling, that would have been considered too risky to deploy without this level of evaluation rigor.
A Reusable Evaluation Asset: The scenario library itself became a durable asset, growing with every new failure mode discovered in production and protecting against regressions across model and platform changes.
Conclusion
Voice agents do not fail in the lab. They fail in cars and kitchens, waiting rooms and warehouses, wherever the people who actually use them happen to be. Evaluating them against the conditions of those environments, including the social dynamics of who is speaking and why, is what separates a demo-quality agent from a production-quality one.
By using Okareo to build evaluation scenarios that take noise, cross-talk, and background coaching seriously, voice AI teams have been able to ship agents that hold up where it counts. The result is not just a better-performing agent, but a development process that treats the messiness of real human conversation as something to engineer for, rather than around.
