Cohere Transcribe debuts as open-source ASR model
Cohere Transcribe launches as a 2B open-source speech-to-text model with 14-language support, self-hosting, and vLLM serving.
Cohere released Cohere Transcribe on March 26, a 2B-parameter open-source ASR model for speech-to-text under Apache 2.0, with weights at CohereLabs/cohere-transcribe-03-2026. If you need self-hosted transcription, this is the important part: the model is narrowly scoped, production-oriented, and already packaged for both local inference and OpenAI-compatible serving.
Model scope
Cohere Transcribe is a dedicated automatic speech recognition model, audio in and text out. It supports 14 languages: English, German, French, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Arabic, Vietnamese, Chinese, Japanese, and Korean.
The architecture is a Conformer encoder with a lightweight Transformer decoder. Input audio is converted to a mel-spectrogram, resampled to 16 kHz when needed, and stereo audio is averaged to mono during preprocessing.
This is a focused release. You do not get a general voice stack, speech synthesis, diarization, or timeline metadata. You get transcription.
Long-form transcription behavior
Cohere built the model for recordings that exceed a single short utterance. Audio longer than 35 seconds is split into overlapping chunks and reassembled automatically through model.transcribe(), and the published usage examples include a 55-minute earnings call.
For developers, this matters more than the model size headline. Long-form chunking is usually where ASR pipelines become operationally messy, especially once you start handling meetings, support calls, and uploaded media at scale. A model that exposes a single transcription method reduces custom orchestration, even if you still need upstream VAD and downstream post-processing.
Deployment paths
Cohere Transcribe supports two clear serving modes: local inference with Transformers and production serving with vLLM.
| Deployment path | Details |
|---|---|
| Transformers | Recommended for local or offline inference |
| vLLM | Recommended for production serving |
| API shape | OpenAI-compatible /v1/audio/transcriptions when served through vLLM |
The vLLM path is the practical story here. If your stack already standardizes on OpenAI-style clients, Cohere Transcribe can slot into the same interface pattern you use elsewhere for model serving and AI inference. That lowers integration cost more than a bespoke audio API would.
Cohere also lists Transcribe in Model Vault as a supported self-serve ASR model, which places it directly inside the company’s enterprise deployment catalog. If you are already thinking about isolated deployments, retention controls, and internal governance, this looks aligned with the same buying pattern as other enterprise-hosted models, similar to broader work on building enterprise AI on your own data.
Performance claims
Cohere’s headline numbers are strong enough to get attention from anyone evaluating open ASR.
| Metric | Cohere claim |
|---|---|
| Parameters | 2B |
| Average WER | 5.42 |
| Human-eval win rate | 61% on accuracy, coherence, and usability |
| Throughput | 525 minutes of audio processed per minute |
| Relative speed claim | Up to 3x faster real-time factor than dedicated ASR models in the same size range |
The named comparison set includes Zoom Scribe v1, IBM Granite 4.0 1B, ElevenLabs Scribe v2, and Qwen3-ASR-1.7B Speech. Cohere also notes weaker results in Portuguese, German, and Spanish than its overall average ranking would suggest.
If you run multilingual transcription, do not treat the aggregate WER as the only number that matters. Per-language variance is often the difference between a viable call-center deployment and a support burden. The same evaluation discipline used for testing AI systems applies here, especially when your traffic is not English-dominant.
Product constraints
Cohere is explicit about the model’s limits, and those limits define where it fits.
| Constraint | Impact |
|---|---|
| Single language per request | You must specify an ISO 639-1 language code |
| No language detection | You need routing logic upstream |
| No timestamps | Subtitle, search, and clip-indexing pipelines need extra tooling |
| No speaker diarization | Meeting transcripts need a separate diarization stage |
| Inconsistent code-switching performance | Mixed-language audio needs careful evaluation |
| Can over-transcribe non-speech | Use noise gating or VAD upstream |
This shapes the integration strategy. If you are building meeting intelligence, this model is one component, not the whole stack. If you are building batch transcription for known-language audio, it is much closer to a drop-in engine.
The missing timestamps and diarization are especially important. Without them, downstream retrieval, call analytics, and agent memory pipelines become harder to structure. If your application indexes transcripts for later retrieval, your chunking and metadata strategy starts to resemble the same tradeoffs you manage in RAG systems, only with audio-derived text as the source.
Competitive position
The release stands out because it combines open weights, a compact 2B footprint, Apache 2.0 licensing, documented vLLM serving, and immediate enterprise packaging through Model Vault. Plenty of ASR models give you one or two of those properties. Fewer give you all of them together.
This is where the model is most useful: self-hosted transcription where data handling matters, latency matters, and you want a deployment path that looks like the rest of your model infrastructure instead of a separate voice-specific stack.
If you are evaluating transcription today, test Cohere Transcribe on your real audio, segmented by language and noise profile, and plan for VAD, diarization, and timestamp enrichment as separate services rather than assuming the base model will cover them.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Build a Domain-Specific Embedding Model
Learn NVIDIA's recipe for fine-tuning a domain-specific embedding model in hours using synthetic data, hard negatives, BEIR, and NIM.
Hugging Face Reports Chinese Open Models Overtook U.S. on Hub as Qwen and DeepSeek Drive Derivative Boom
Hugging Face's Spring 2026 report says Chinese open models now lead Hub adoption, with Qwen and DeepSeek powering a surge in derivatives.
How to Deploy Mistral Small 4 for Multimodal Reasoning and Coding
Learn how to deploy Mistral Small 4 with reasoning controls, multimodal input, and optimized serving on API, Hugging Face, or NVIDIA.
How to Run IBM Granite 4.0 1B Speech for Multilingual Edge ASR and Translation
Learn how to deploy IBM Granite 4.0 1B Speech for fast multilingual ASR and translation on edge devices.
ServiceNow Ships a Benchmark for Testing Enterprise Voice Agents
ServiceNow AI released EVA, an open-source benchmark for evaluating voice agents on both task accuracy and spoken interaction quality.