ServiceNow Introduces SWER to Benchmark ASR Code-Switching
ServiceNow AI released a Hugging Face dataset evaluating frontier speech models on bilingual code-switching, introducing the Switch Word Error Rate metric.
On June 9, 2026, ServiceNow AI published the ServiceNow Code-Switching Benchmark on Hugging Face. This new dataset and evaluation framework tests how frontier Automatic Speech Recognition (ASR) models process conversations that switch languages mid-sentence. The research targets a persistent gap in enterprise voice applications, where bilingual customer populations frequently blend languages like Spanish and English or Hindi and English.
Voice AI evaluation traditionally relies on Word Error Rate to measure transcription accuracy. While frontier models handle distinct monolingual streams well, ServiceNow found that switching languages within a single phonetic sequence breaks current tokenization and acoustic mapping strategies.
New Metrics for Bilingual Speech
To measure this specific failure mode, the research introduces two granular metrics alongside standard WER (Word Error Rate):
- SWER (Switch Word Error Rate): Calculates transcription accuracy specifically at the exact boundary where the speaker switches languages.
- AER (Accent Error Rate): Isolates performance drops caused by regional phonetic variations in code-switched contexts.
| Metric | Target Measurement | Primary Enterprise Use Case |
|---|---|---|
| WER | Overall transcription accuracy | Baseline performance tracking |
| SWER | Accuracy at the language transition point | Bilingual contact center evaluation |
| AER | Robustness to regional phonetic shifts | Global IT helpdesk deployment |
The Switching Penalty and Hallucinations
ServiceNow tested several frontier ASR models, including versions of OpenAI’s Whisper and proprietary enterprise systems. The benchmark revealed a severe switching penalty across all tested architectures. Even top-tier models experience a sharp spike in SWER at the moment of language transition.
Instead of recognizing a foreign word, models frequently hallucinate or force-map the audio into the primary language’s phonemes. For example, a Spanish word inserted into an English sentence is often transcribed as a phonetically similar but misspelled English word. This processing error adds latency and forces downstream natural language understanding systems to parse garbled text. If you build real-time voice agents, this hallucination penalty directly impacts your latency budget and intent routing accuracy.
Hinglish vs Spanglish Performance
The benchmark focused heavily on Spanish-English (Spanglish) and Hindi-English (Hinglish) datasets. Performance on Hinglish trailed significantly behind Spanglish. The research attributes this gap to the diverse phonetic structures and scripts involved in Hindi-English transitions. While models are increasingly multilingual, they lack the inter-lingual training required to parse distinct phonetic systems colliding in the same audio frame. Developers looking to run multilingual edge ASR will need to account for this structural performance gap in their deployment regions.
Integration With Now Assist
This ASR research follows the May 2026 launch of the ServiceNow Action Fabric, which opened the company’s platform to agentic workflows. By benchmarking bilingual transcription, ServiceNow is defining the requirements for enterprise-grade speech inputs to its Now Assist generative AI platform. Accurate parsing of code-switched speech is a prerequisite for routing autonomous workforce operations in global call centers. Before an agent can execute a workflow, it must accurately transcribe the intent of a caller who naturally switches between languages.
For teams deploying voice infrastructure and evaluating AI agents, standard WER is no longer a sufficient evaluation metric for global user bases. You should incorporate SWER tracking into your pipeline to identify where language transitions degrade intent recognition. Building custom acoustic models or fine-tuning existing systems on code-switched datasets will be necessary to prevent phonetic hallucinations from breaking downstream agentic workflows.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Run In-Loop Model Evaluations With olmo-eval
Learn how to set up olmo-eval to test large language model checkpoints during the training process using vLLM, LiteLLM, and Docker-based agent sandboxes.
Private Evaluation Track Deters Open ASR Benchmaxxing
Hugging Face partnered with Appen and DataoceanAI to introduce a private evaluation track to the Open ASR Leaderboard, mitigating test-set contamination.
Outpacing Whisper: Cohere Transcribe Hits Top ASR Speed
Experience enterprise-grade audio intelligence with Cohere Transcribe, a new open-weights model topping the ASR leaderboard with 3x faster speeds than Whisper.
Cohere Transcribe debuts as open-source ASR model
Cohere Transcribe launches as a 2B open-source speech-to-text model with 14-language support, self-hosting, and vLLM serving.
Parallel Search Powers Sesame's New iOS Voice Agent App
The Oculus founders' startup Sesame has launched a public preview iOS app featuring low-latency voice agents driven by simultaneous parallel search.