EVA-Bench 2.0 Pits 12 Voice Models Against 213 Tasks

On June 4, ServiceNow AI released EVA-Bench Data 2.0, a large-scale open-source framework for benchmarking conversational voice agents. The release shifts the focus from simple text-based interactions toward complex, multi-turn enterprise workflows that rely on speech-to-speech (S2S) models and audio-native large language models.

Part of ServiceNow’s broader NOWAI-Bench initiative showcased at Knowledge 2026, the framework uses a bot-to-bot architecture to automate the evaluation of spoken dialogue. It is distributed via Hugging Face and GitHub under an open-source license.

Enterprise Workflows and Tool Use

The dataset is built around the specific failure modes of voice-first enterprise applications. It categorizes testing data into three core domains: HR, IT Support, and Customer Service.

Across these domains, the benchmark tests agents on 213 distinct, multi-turn dialogue scenarios. These scenarios require the AI to successfully navigate and utilize 121 simulated enterprise tools, including ticket creation systems, hardware troubleshooting interfaces, and benefits lookup databases. For developers trying to evaluate and test AI agents, this provides a standardized sandbox for measuring autonomous tool execution in voice-driven environments.

Accuracy and Experience Metrics

Voice agents fail differently than text-based chatbots, requiring different telemetry. The research team, led by Hari Subramani, Issam Laradji, and Anitha Raghavan, split the framework into two fundamental measurement dimensions:

EVA-A (Accuracy): Measures whether the agent successfully completed the requested task and remained faithful to the provided enterprise data during the conversation.
EVA-X (Experience): Measures the naturalness, conciseness, and appropriateness of the spoken dialogue. This identifies specific voice-interface issues like robotic phrasing or inappropriate interruptions over the user.

To ensure models are tested against real-world conditions, EVA-Bench 2.0 includes a comprehensive perturbation suite. This applies various audio stressors to the inputs, including diverse user accents, environmental background noises, and simulated network connection degradation.

Architectural Comparisons

The initial dataset includes benchmark results for 12 leading systems. This testing highlights the performance differences between traditional cascade architectures—which chain separate speech-to-text, text-based LLM, and text-to-speech models—and newer audio-native frontier models like GPT-4o and Gemini 1.5 Pro.

If you are deciding between GPT vs Claude vs Gemini for a real-time voice integration, the EVA-Bench results expose exactly how these models handle complex tool-calling under varying levels of audio stress.

If you build voice-first enterprise applications, EVA-Bench 2.0 provides a reproducible way to measure regression during model updates. You can run the testing suite locally using the project’s specialized Python orchestration engine (supporting Python 3.11–3.13) or explore the scenario data directly through the Hugging Face dataset viewer before building your own evaluations.

EVA-Bench 2.0 Pits 12 Voice Models Against 213 Tasks

Enterprise Workflows and Tool Use

Accuracy and Experience Metrics

Architectural Comparisons

Keep Reading

Build Real-Time Voice Agents with Cloudflare Agents SDK

Sierra Buys Fragment to Connect Agents to Databases

ServiceNow Ships a Benchmark for Testing Enterprise Voice Agents

IBM Pivots to Agent Logic to Control Multi-Step AI Workflows

Parallel Search Powers Sesame's New iOS Voice Agent App