EVA-Bench 2.0 Pits 12 Voice Models Against 213 Tasks
ServiceNow AI has released EVA-Bench Data 2.0, an open-source evaluation framework that tests conversational voice agents across 121 enterprise tools.
On June 4, ServiceNow AI released EVA-Bench Data 2.0, a large-scale open-source framework for benchmarking conversational voice agents. The release shifts the focus from simple text-based interactions toward complex, multi-turn enterprise workflows that rely on speech-to-speech (S2S) models and audio-native large language models.
Part of ServiceNow’s broader NOWAI-Bench initiative showcased at Knowledge 2026, the framework uses a bot-to-bot architecture to automate the evaluation of spoken dialogue. It is distributed via Hugging Face and GitHub under an open-source license.
Enterprise Workflows and Tool Use
The dataset is built around the specific failure modes of voice-first enterprise applications. It categorizes testing data into three core domains: HR, IT Support, and Customer Service.
Across these domains, the benchmark tests agents on 213 distinct, multi-turn dialogue scenarios. These scenarios require the AI to successfully navigate and utilize 121 simulated enterprise tools, including ticket creation systems, hardware troubleshooting interfaces, and benefits lookup databases. For developers trying to evaluate and test AI agents, this provides a standardized sandbox for measuring autonomous tool execution in voice-driven environments.
Accuracy and Experience Metrics
Voice agents fail differently than text-based chatbots, requiring different telemetry. The research team, led by Hari Subramani, Issam Laradji, and Anitha Raghavan, split the framework into two fundamental measurement dimensions:
- EVA-A (Accuracy): Measures whether the agent successfully completed the requested task and remained faithful to the provided enterprise data during the conversation.
- EVA-X (Experience): Measures the naturalness, conciseness, and appropriateness of the spoken dialogue. This identifies specific voice-interface issues like robotic phrasing or inappropriate interruptions over the user.
To ensure models are tested against real-world conditions, EVA-Bench 2.0 includes a comprehensive perturbation suite. This applies various audio stressors to the inputs, including diverse user accents, environmental background noises, and simulated network connection degradation.
Architectural Comparisons
The initial dataset includes benchmark results for 12 leading systems. This testing highlights the performance differences between traditional cascade architectures—which chain separate speech-to-text, text-based LLM, and text-to-speech models—and newer audio-native frontier models like GPT-4o and Gemini 1.5 Pro.
If you are deciding between GPT vs Claude vs Gemini for a real-time voice integration, the EVA-Bench results expose exactly how these models handle complex tool-calling under varying levels of audio stress.
If you build voice-first enterprise applications, EVA-Bench 2.0 provides a reproducible way to measure regression during model updates. You can run the testing suite locally using the project’s specialized Python orchestration engine (supporting Python 3.11–3.13) or explore the scenario data directly through the Hugging Face dataset viewer before building your own evaluations.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
Build Real-Time Voice Agents with Cloudflare Agents SDK
Learn how to integrate low-latency voice interactions into your AI agents using Cloudflare's new @cloudflare/voice package and Durable Objects.
Sierra Buys Fragment to Connect Agents to Databases
Enterprise AI startup Sierra has acquired the Paris-based startup Fragment to enhance its conversational platform with specialized database integrations.
ServiceNow Ships a Benchmark for Testing Enterprise Voice Agents
ServiceNow AI released EVA, an open-source benchmark for evaluating voice agents on both task accuracy and spoken interaction quality.
IBM Pivots to Agent Logic to Control Multi-Step AI Workflows
A joint technical publication from IBM and Hugging Face details how strict state management and formal logic layers can govern long-running enterprise agents.
Parallel Search Powers Sesame's New iOS Voice Agent App
The Oculus founders' startup Sesame has launched a public preview iOS app featuring low-latency voice agents driven by simultaneous parallel search.