Ai Engineering 3 min read

Outpacing Whisper: Cohere Transcribe Hits Top ASR Speed

Experience enterprise-grade audio intelligence with Cohere Transcribe, a new open-weights model topping the ASR leaderboard with 3x faster speeds than Whisper.

Cohere has expanded its model portfolio into audio intelligence with the release of Cohere Transcribe. The 2-billion parameter speech-to-text model targets enterprise-grade, real-time transcription under an Apache 2.0 license. For developers building voice-driven applications, the release shifts the performance baseline established by older models like Whisper. The system achieves a Real-Time Factor (RTFx) up to 524x on high-end hardware. The resulting processing speeds are roughly three times faster than Whisper-v3-large.

Architecture and Training

Cohere Transcribe uses a 2B-parameter encoder-decoder X-attention transformer architecture. To optimize for low-latency AI inference, the developers allocated over 90% of the parameter count to a Fast-Conformer encoder. The decoder remains lightweight. This asymmetry speeds up transcription generation while maintaining high accuracy.

The company trained this model from scratch using supervised cross-entropy. They avoided distillation techniques common in recent lightweight audio models. The training data focused strictly on enterprise-critical languages to maximize performance in business contexts.

Benchmark Results

The model currently ranks first for English ASR on the Hugging Face Open ASR Leaderboard. It averages a 5.42% Word Error Rate (WER) across eight industry benchmarks. Performance on clean audio is highly accurate, while meeting data shows expected real-world degradation.

BenchmarkWord Error Rate (WER)
English Average (8 benchmarks)5.42%
LibriSpeech Clean1.25%
AMI Meeting Transcription8.15%

Deployment and Integration

If you run LLMs locally, you can deploy Cohere Transcribe directly on edge devices and laptops. It does not require a cloud API or server-grade GPUs for basic operation. For production environments, the model ships with native vLLM support for high-throughput serving. The Hugging Face transformers library also supports the model natively in version 5.4.0 and higher.

The architecture handles long-form audio natively. If you build real-time voice agents, the system automatically chunks audio inputs longer than 35 seconds. It then reassembles the segments into a unified transcript without manual intervention. The model is available immediately on Hugging Face and Microsoft Azure Foundry.

Current Limitations

The initial v03-2026 release omits several common ASR features. You must specify the input language manually, as the model lacks automatic language detection. It supports 14 languages: English, French, German, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Arabic, Vietnamese, Chinese (Mandarin), Japanese, and Korean. The architecture also lacks native speaker diarization and timestamping capabilities.

If your application requires raw transcription speed and offline deployment, Cohere Transcribe offers a compelling upgrade path. The Apache 2.0 license removes commercial usage barriers. Evaluate your pipeline requirements carefully, as the lack of native diarization and timestamping will require secondary processing layers for complex meeting analysis workflows.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading