Cohere Transcribe debuts as open-source ASR model

Cohere released Cohere Transcribe on March 26, a 2B-parameter open-source ASR model for speech-to-text under Apache 2.0, with weights at CohereLabs/cohere-transcribe-03-2026. If you need self-hosted transcription, this is the important part: the model is narrowly scoped, production-oriented, and already packaged for both local inference and OpenAI-compatible serving.

Model scope

Cohere Transcribe is a dedicated automatic speech recognition model, audio in and text out. It supports 14 languages: English, German, French, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Arabic, Vietnamese, Chinese, Japanese, and Korean.

The architecture is a Conformer encoder with a lightweight Transformer decoder. Input audio is converted to a mel-spectrogram, resampled to 16 kHz when needed, and stereo audio is averaged to mono during preprocessing.

This is a focused release. You do not get a general voice stack, speech synthesis, diarization, or timeline metadata. You get transcription.

Long-form transcription behavior

Cohere built the model for recordings that exceed a single short utterance. Audio longer than 35 seconds is split into overlapping chunks and reassembled automatically through model.transcribe(), and the published usage examples include a 55-minute earnings call.

For developers, this matters more than the model size headline. Long-form chunking is usually where ASR pipelines become operationally messy, especially once you start handling meetings, support calls, and uploaded media at scale. A model that exposes a single transcription method reduces custom orchestration, even if you still need upstream VAD and downstream post-processing.

Deployment paths

Cohere Transcribe supports two clear serving modes: local inference with Transformers and production serving with vLLM.

Deployment path	Details
Transformers	Recommended for local or offline inference
vLLM	Recommended for production serving
API shape	OpenAI-compatible `/v1/audio/transcriptions` when served through vLLM

The vLLM path is the practical story here. If your stack already standardizes on OpenAI-style clients, Cohere Transcribe can slot into the same interface pattern you use elsewhere for model serving and AI inference. That lowers integration cost more than a bespoke audio API would.

Cohere also lists Transcribe in Model Vault as a supported self-serve ASR model, which places it directly inside the company’s enterprise deployment catalog. If you are already thinking about isolated deployments, retention controls, and internal governance, this looks aligned with the same buying pattern as other enterprise-hosted models, similar to broader work on building enterprise AI on your own data.

Performance claims

Cohere’s headline numbers are strong enough to get attention from anyone evaluating open ASR.

Metric	Cohere claim
Parameters	2B
Average WER	5.42
Human-eval win rate	61% on accuracy, coherence, and usability
Throughput	525 minutes of audio processed per minute
Relative speed claim	Up to 3x faster real-time factor than dedicated ASR models in the same size range

The named comparison set includes Zoom Scribe v1, IBM Granite 4.0 1B, ElevenLabs Scribe v2, and Qwen3-ASR-1.7B Speech. Cohere also notes weaker results in Portuguese, German, and Spanish than its overall average ranking would suggest.

If you run multilingual transcription, do not treat the aggregate WER as the only number that matters. Per-language variance is often the difference between a viable call-center deployment and a support burden. The same evaluation discipline used for testing AI systems applies here, especially when your traffic is not English-dominant.

Product constraints

Cohere is explicit about the model’s limits, and those limits define where it fits.

Constraint	Impact
Single language per request	You must specify an ISO 639-1 language code
No language detection	You need routing logic upstream
No timestamps	Subtitle, search, and clip-indexing pipelines need extra tooling
No speaker diarization	Meeting transcripts need a separate diarization stage
Inconsistent code-switching performance	Mixed-language audio needs careful evaluation
Can over-transcribe non-speech	Use noise gating or VAD upstream

This shapes the integration strategy. If you are building meeting intelligence, this model is one component, not the whole stack. If you are building batch transcription for known-language audio, it is much closer to a drop-in engine.

The missing timestamps and diarization are especially important. Without them, downstream retrieval, call analytics, and agent memory pipelines become harder to structure. If your application indexes transcripts for later retrieval, your chunking and metadata strategy starts to resemble the same tradeoffs you manage in RAG systems, only with audio-derived text as the source.

Competitive position

The release stands out because it combines open weights, a compact 2B footprint, Apache 2.0 licensing, documented vLLM serving, and immediate enterprise packaging through Model Vault. Plenty of ASR models give you one or two of those properties. Fewer give you all of them together.

This is where the model is most useful: self-hosted transcription where data handling matters, latency matters, and you want a deployment path that looks like the rest of your model infrastructure instead of a separate voice-specific stack.

If you are evaluating transcription today, test Cohere Transcribe on your real audio, segmented by language and noise profile, and plan for VAD, diarization, and timestamp enrichment as separate services rather than assuming the base model will cover them.

Cohere Transcribe debuts as open-source ASR model

Model scope

Long-form transcription behavior

Deployment paths

Performance claims

Product constraints

Competitive position

Keep Reading

How to Expose Ephemeral vLLM Endpoints on Hugging Face Jobs

Outpacing Whisper: Cohere Transcribe Hits Top ASR Speed

World Models and DAgger Integration Ship in LeRobot v0.6.0

229,000 Standardized Benchmark Results Hit Hugging Face Models

Far-Field Benchmark Shows Massive Gap in Low SNR Speech Models