Ai Engineering 3 min read

ServiceNow Introduces SWER to Benchmark ASR Code-Switching

ServiceNow AI released a Hugging Face dataset evaluating frontier speech models on bilingual code-switching, introducing the Switch Word Error Rate metric.

On June 9, 2026, ServiceNow AI published the ServiceNow Code-Switching Benchmark on Hugging Face. This new dataset and evaluation framework tests how frontier Automatic Speech Recognition (ASR) models process conversations that switch languages mid-sentence. The research targets a persistent gap in enterprise voice applications, where bilingual customer populations frequently blend languages like Spanish and English or Hindi and English.

Voice AI evaluation traditionally relies on Word Error Rate to measure transcription accuracy. While frontier models handle distinct monolingual streams well, ServiceNow found that switching languages within a single phonetic sequence breaks current tokenization and acoustic mapping strategies.

New Metrics for Bilingual Speech

To measure this specific failure mode, the research introduces two granular metrics alongside standard WER (Word Error Rate):

  • SWER (Switch Word Error Rate): Calculates transcription accuracy specifically at the exact boundary where the speaker switches languages.
  • AER (Accent Error Rate): Isolates performance drops caused by regional phonetic variations in code-switched contexts.
MetricTarget MeasurementPrimary Enterprise Use Case
WEROverall transcription accuracyBaseline performance tracking
SWERAccuracy at the language transition pointBilingual contact center evaluation
AERRobustness to regional phonetic shiftsGlobal IT helpdesk deployment

The Switching Penalty and Hallucinations

ServiceNow tested several frontier ASR models, including versions of OpenAI’s Whisper and proprietary enterprise systems. The benchmark revealed a severe switching penalty across all tested architectures. Even top-tier models experience a sharp spike in SWER at the moment of language transition.

Instead of recognizing a foreign word, models frequently hallucinate or force-map the audio into the primary language’s phonemes. For example, a Spanish word inserted into an English sentence is often transcribed as a phonetically similar but misspelled English word. This processing error adds latency and forces downstream natural language understanding systems to parse garbled text. If you build real-time voice agents, this hallucination penalty directly impacts your latency budget and intent routing accuracy.

Hinglish vs Spanglish Performance

The benchmark focused heavily on Spanish-English (Spanglish) and Hindi-English (Hinglish) datasets. Performance on Hinglish trailed significantly behind Spanglish. The research attributes this gap to the diverse phonetic structures and scripts involved in Hindi-English transitions. While models are increasingly multilingual, they lack the inter-lingual training required to parse distinct phonetic systems colliding in the same audio frame. Developers looking to run multilingual edge ASR will need to account for this structural performance gap in their deployment regions.

Integration With Now Assist

This ASR research follows the May 2026 launch of the ServiceNow Action Fabric, which opened the company’s platform to agentic workflows. By benchmarking bilingual transcription, ServiceNow is defining the requirements for enterprise-grade speech inputs to its Now Assist generative AI platform. Accurate parsing of code-switched speech is a prerequisite for routing autonomous workforce operations in global call centers. Before an agent can execute a workflow, it must accurately transcribe the intent of a caller who naturally switches between languages.

For teams deploying voice infrastructure and evaluating AI agents, standard WER is no longer a sufficient evaluation metric for global user bases. You should incorporate SWER tracking into your pipeline to identify where language transitions degrade intent recognition. Building custom acoustic models or fine-tuning existing systems on code-switched datasets will be necessary to prevent phonetic hallucinations from breaking downstream agentic workflows.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading