ServiceNow Ships a Benchmark for Testing Enterprise Voice Agents

ServiceNow AI has launched EVA, a new benchmark for conversational voice agents that scores both task success and spoken interaction quality in the same end-to-end evaluation. The release includes the EVA framework, dataset, and leaderboard, with 50 airline scenarios and results across 20 voice systems. If you build production voice agents, this matters because EVA evaluates the full call flow over live audio, not a text-only proxy.

Benchmark design

EVA targets a real gap in agent evaluation. Existing benchmarks often isolate speech recognition, language quality, or single-turn dialog performance. EVA measures a complete multi-turn voice workflow, where the agent has to hear the user, reason over policy and constraints, call tools, speak the result back, and finish the task correctly.

The framework uses a five-part bot-to-bot audio setup: User Simulator, Voice Agent, Tool Executor, Validators, and Metrics Suite. The agent under test can be a cascade stack, STT to LLM to TTS, or an audio-native system such as S2S or S2T to TTS.

This is an important distinction for teams working on AI agents vs chatbots. EVA is not evaluating a scripted IVR or a text agent wrapped in audio. It is measuring whether a voice agent can complete a transactional workflow under telephony-style conditions.

Dataset and scenario scope

The first release is narrow by design. EVA ships with 50 English-language airline support scenarios covering flight rebooking, cancellations, same-day standby, and compensation vouchers.

Those scenarios are built to stress the failure modes that usually matter in production, temporal reasoning, policy-following, constraint satisfaction, and named-entity handling. In voice systems, that last category is often where expensive errors happen. Confirmation codes, flight numbers, dates, seat assignments, and dollar amounts are easy to mangle somewhere between ASR, reasoning, and TTS.

ServiceNow’s current experimental setup uses an ElevenLabs Agent as the simulator, GPT-4.1 as the simulator LLM, μ-law telephony audio at 8000 Hz, a 600 second maximum conversation duration, and interruptions disabled. Those details matter because they shape what the benchmark is actually measuring. EVA is end-to-end, but it is also tied to a specific voice simulation stack.

Scoring model

EVA reports two top-level scores: EVA-A for accuracy and EVA-X for experience.

EVA-A includes Task Completion, Faithfulness, and Speech Fidelity. Task Completion is deterministic, based on the expected versus actual database end state. Faithfulness uses a judge to check grounding, hallucinations, policy violations, and misrepresentation. Speech Fidelity checks whether the spoken output preserved critical entities correctly.

EVA-X measures Conciseness, Conversation Progression, and Turn-Taking. Across the full framework, the released metric system contains 15 metrics spanning validation, accuracy, experience, and diagnostics. If you already use LLM-as-judge evaluation, EVA extends that pattern into live spoken interaction, where transcript quality alone is not enough.

Reliability is part of the benchmark

One of EVA’s better design choices is that it measures consistency explicitly. Results are reported with pass@k, the chance that at least one of k runs succeeds, and pass^k, the chance that all k runs succeed. The launch uses three trials per scenario, with k = 3.

For developers, this is the operationally useful number. A voice agent that succeeds once in three attempts can look capable in a demo and still fail your support queue. EVA’s published results show a substantial gap between pass@3 and pass^3 across tested systems, which means inconsistency remains a core problem.

Metric	Meaning	Why it matters
pass@3	At least one of three runs succeeds	Captures best-case capability
pass^3	All three runs succeed	Captures production reliability

If you are already working on evaluating agents, EVA offers a cleaner way to separate occasional competence from repeatable performance.

Main findings from the launch

The benchmark highlights three patterns.

First, there is a clear accuracy-experience tradeoff. Systems that complete tasks more reliably often produce a weaker spoken interaction, while smoother conversational systems can lose precision on the underlying task.

Second, named-entity transcription and reproduction is a dominant failure mode. EVA calls out confirmation codes, flight numbers, and monetary amounts specifically. This aligns with how voice agents fail in practice, not at the abstract reasoning layer alone, but at the boundary between speech and structured state. If your workflows depend on tool use, this connects directly to the mechanics of function calling, because an entity error upstream often becomes a wrong tool argument downstream.

Third, multi-step workflows remain the complexity breaker. Rebooking while preserving ancillary services such as seats and baggage is one of the hard cases in the initial dataset. The issue is not just reasoning depth. It is whether the agent can maintain state across turns, follow policy, and produce correct spoken confirmations at each step.

Reproducibility and deployment implications

EVA is open source, but the released setup is not self-contained. The default configuration uses multiple commercial providers for simulation and judge metrics, including OpenAI, Gemini through Vertex AI, and Claude through Bedrock. Full reproduction requires access to those APIs.

The implementation also exposes the operational shape of a modern voice benchmark. Python 3.11 is required. The CLI entry point is eva. Output artifacts include assistant audio, user audio, mixed-call audio, transcripts, audit logs, and metrics files. This is useful if you care about LLM observability, because debugging voice agents requires more than text traces.

If you build voice agents, use EVA to test consistency under real audio conditions, then inspect failures at the entity boundary first. The fastest quality gains are likely to come from tightening ASR, tool argument validation, and spoken confirmation logic before tuning broader conversational style.

ServiceNow Ships a Benchmark for Testing Enterprise Voice Agents

Benchmark design

Dataset and scenario scope

Scoring model

Reliability is part of the benchmark

Main findings from the launch

Reproducibility and deployment implications

Keep Reading

How to Benchmark Custom AI Agent Tools via Hugging Face

OpenEnv Standardizes Agentic RL With Universal Action Space API

EVA-Bench 2.0 Pits 12 Voice Models Against 213 Tasks

Open Agent Leaderboard Evaluates Full Scaffolding and Task Costs

Voxtral TTS: Mistral's Open-Source Answer to Voice Agents