Outpacing Whisper: Cohere Transcribe Hits Top ASR Speed
Experience enterprise-grade audio intelligence with Cohere Transcribe, a new open-weights model topping the ASR leaderboard with 3x faster speeds than Whisper.
Cohere has expanded its model portfolio into audio intelligence with the release of Cohere Transcribe. The 2-billion parameter speech-to-text model targets enterprise-grade, real-time transcription under an Apache 2.0 license. For developers building voice-driven applications, the release shifts the performance baseline established by older models like Whisper. The system achieves a Real-Time Factor (RTFx) up to 524x on high-end hardware. The resulting processing speeds are roughly three times faster than Whisper-v3-large.
Architecture and Training
Cohere Transcribe uses a 2B-parameter encoder-decoder X-attention transformer architecture. To optimize for low-latency AI inference, the developers allocated over 90% of the parameter count to a Fast-Conformer encoder. The decoder remains lightweight. This asymmetry speeds up transcription generation while maintaining high accuracy.
The company trained this model from scratch using supervised cross-entropy. They avoided distillation techniques common in recent lightweight audio models. The training data focused strictly on enterprise-critical languages to maximize performance in business contexts.
Benchmark Results
The model currently ranks first for English ASR on the Hugging Face Open ASR Leaderboard. It averages a 5.42% Word Error Rate (WER) across eight industry benchmarks. Performance on clean audio is highly accurate, while meeting data shows expected real-world degradation.
| Benchmark | Word Error Rate (WER) |
|---|---|
| English Average (8 benchmarks) | 5.42% |
| LibriSpeech Clean | 1.25% |
| AMI Meeting Transcription | 8.15% |
Deployment and Integration
If you run LLMs locally, you can deploy Cohere Transcribe directly on edge devices and laptops. It does not require a cloud API or server-grade GPUs for basic operation. For production environments, the model ships with native vLLM support for high-throughput serving. The Hugging Face transformers library also supports the model natively in version 5.4.0 and higher.
The architecture handles long-form audio natively. If you build real-time voice agents, the system automatically chunks audio inputs longer than 35 seconds. It then reassembles the segments into a unified transcript without manual intervention. The model is available immediately on Hugging Face and Microsoft Azure Foundry.
Current Limitations
The initial v03-2026 release omits several common ASR features. You must specify the input language manually, as the model lacks automatic language detection. It supports 14 languages: English, French, German, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Arabic, Vietnamese, Chinese (Mandarin), Japanese, and Korean. The architecture also lacks native speaker diarization and timestamping capabilities.
If your application requires raw transcription speed and offline deployment, Cohere Transcribe offers a compelling upgrade path. The Apache 2.0 license removes commercial usage barriers. Evaluate your pipeline requirements carefully, as the lack of native diarization and timestamping will require secondary processing layers for complex meeting analysis workflows.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Build Enterprise AI with Mistral Forge on Your Own Data
Learn how Mistral Forge helps enterprises build custom AI models with private data, synthetic data, evals, and flexible deployment.
Cohere Transcribe debuts as open-source ASR model
Cohere Transcribe launches as a 2B open-source speech-to-text model with 14-language support, self-hosting, and vLLM serving.
GLM-5.1 MoE Beats GPT-5.4 in Open-Source Engineering Milestone
Zhipu AI releases GLM-5.1 under MIT license, a 744B parameter MoE model that outperforms GPT-5.4 on the SWE-Bench Pro software engineering benchmark.
Google AI Edge Eloquent brings free offline dictation to iOS
Google's new AI Edge Eloquent app uses Gemma 4 models to offer high-quality, offline-first transcription and text polishing for free on iPhone.
Microsoft Releases MAI-Transcribe-1 to Rival Whisper
Microsoft AI unveils MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 to reduce reliance on OpenAI with high-efficiency, in-house foundational models.