How to Run IBM Granite 4.0 1B Speech for Multilingual Edge ASR and Translation
Learn how to deploy IBM Granite 4.0 1B Speech for fast multilingual ASR and translation on edge devices.
IBM Granite 4.0 1B Speech gives you a compact open model for multilingual automatic speech recognition (ASR) and speech translation that is small enough for edge-oriented deployments. Released on Hugging Face in early March 2026, it adds Japanese support, keyword biasing, and native runtime support across Transformers, vLLM, and MLX. The official announcement and model card cover the full feature set. This walkthrough shows how to run it locally, transcribe audio, translate speech, and choose the right runtime for your hardware.
What Granite 4.0 1B Speech supports
Granite 4.0 1B Speech is a 1B-parameter speech-language model built for:
- ASR in English, French, German, Spanish, Portuguese, and Japanese
- Bidirectional speech translation involving English and those languages
- Intended translation support including English↔Italian and English↔Mandarin in the model card
- Keyword list biasing for names, acronyms, and domain terms
IBM reports these benchmark numbers on the Hugging Face Open ASR leaderboard:
| Metric | Value |
|---|---|
| Average WER | 5.52 |
| RTFx | 280.02 |
| LibriSpeech Clean WER | 1.42 |
| LibriSpeech Other WER | 2.85 |
| AMI WER | 8.44 |
| Earnings22 WER | 8.48 |
| VoxPopuli WER | 5.84 |
The model also uses granite-4.0-1b-base underneath, with a 128k context length. If you work with long transcripts or agent pipelines that need to preserve large histories, the context tradeoffs are similar to other long-context systems discussed in Context Windows Explained: Why Your AI Forgets.
Installation and setup with Transformers
The model card specifies transformers>=4.52.1. Start with a Python environment that has PyTorch, Transformers, and audio utilities.
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install "transformers>=4.52.1" torch torchaudio accelerate soundfile librosa
If you plan to run on Apple Silicon with MLX later, keep that environment separate. The MLX path uses different packages.
You also need a Hugging Face login if the runtime expects authenticated downloads:
huggingface-cli login
Basic ASR with a local audio file
For the simplest path, load the model and processor from Hugging Face, then send an audio waveform plus a text prompt that requests transcription.
import torch
import soundfile as sf
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
model_id = "ibm-granite/granite-4.0-1b-speech"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
device_map="auto"
)
audio, sr = sf.read("sample_en.wav")
prompt = "Transcribe the audio in English."
inputs = processor(
text=prompt,
audios=audio,
sampling_rate=sr,
return_tensors="pt"
)
inputs = {k: v.to(model.device) if hasattr(v, "to") else v for k, v in inputs.items()}
with torch.no_grad():
generated_ids = model.generate(
**inputs,
max_new_tokens=256,
do_sample=False
)
text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(text)
This pattern is straightforward. Load audio, provide a task instruction, generate text.
If your application already has a local model execution pipeline, this fits naturally into the same deployment pattern as other local model workflows described in How to Run LLMs Locally on Your Machine.
Multilingual transcription prompts
Granite is instruction-following, so task prompts matter. Use explicit prompts for the source language and output format.
Examples:
| Use case | Prompt |
|---|---|
| English ASR | Transcribe the audio in English. |
| Japanese ASR | Transcribe the audio in Japanese. |
| French ASR, preserve punctuation | Transcribe the audio in French with punctuation. |
| Domain transcript | Transcribe the audio in German. Preserve product names and acronyms. |
Example for Japanese:
audio, sr = sf.read("sample_ja.wav")
inputs = processor(
text="Transcribe the audio in Japanese.",
audios=audio,
sampling_rate=sr,
return_tensors="pt"
)
inputs = {k: v.to(model.device) if hasattr(v, "to") else v for k, v in inputs.items()}
with torch.no_grad():
generated_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False)
result = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(result)
For production systems, keep prompts templated and versioned. Prompt drift changes transcript style, punctuation, and language behavior. The same operational discipline applies to speech models as it does to text models in Prompt Engineering Guide: How to Write Better AI Prompts.
Speech translation with Granite 4.0 1B Speech
The model supports automatic speech translation (AST), including bidirectional translation involving English. You can request translation directly in the prompt.
audio, sr = sf.read("sample_es.wav")
inputs = processor(
text="Translate the Spanish audio to English.",
audios=audio,
sampling_rate=sr,
return_tensors="pt"
)
inputs = {k: v.to(model.device) if hasattr(v, "to") else v for k, v in inputs.items()}
with torch.no_grad():
generated_ids = model.generate(
**inputs,
max_new_tokens=256,
do_sample=False
)
translation = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(translation)
You can invert the direction as well:
Translate the English audio to German.Translate the Japanese audio to English.
For downstream applications, it often helps to separate transcription and translation into two evaluation steps. That gives you cleaner error analysis and makes it easier to decide whether a retrieval step or a post-processing model should be added later. If you build that kind of pipeline, the same orchestration choices show up in AI Agent Frameworks Compared: LangChain vs CrewAI vs LlamaIndex.
Keyword biasing for names and acronyms
One of the most useful additions in this release is keyword list biasing. IBM calls out better handling for names and acronyms, which matters for enterprise audio, support calls, internal meetings, and medical or technical vocabulary.
The exact prompt format can evolve with runtime support, but the practical pattern is to inject a controlled list into the instruction:
keywords = ["GetAIBook", "MCP", "Kubernetes", "PostgreSQL", "Granite", "ETL"]
prompt = (
"Transcribe the audio in English. "
"Prefer these keywords when relevant: "
+ ", ".join(keywords)
+ ". Preserve capitalization for acronyms."
)
audio, sr = sf.read("meeting.wav")
inputs = processor(
text=prompt,
audios=audio,
sampling_rate=sr,
return_tensors="pt"
)
inputs = {k: v.to(model.device) if hasattr(v, "to") else v for k, v in inputs.items()}
with torch.no_grad():
generated_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)
transcript = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(transcript)
This is especially useful when the cost of a wrong entity is higher than the cost of a minor punctuation mistake. If your application later maps transcripts into structured records, pair transcription with explicit output validation, similar to the approach in Structured Output from LLMs: JSON Mode Explained.
Running Granite Speech with vLLM
IBM also lists vLLM support, which is the better fit when you need a server endpoint, request batching, or a shared inference service.
Install vLLM in a dedicated environment:
pip install vllm
Then start a server for the model:
python -m vllm.entrypoints.openai.api_server \
--model ibm-granite/granite-4.0-1b-speech \
--trust-remote-code
A client request typically includes both text and audio content. The exact multimodal message schema can vary by vLLM version, so check the model card examples first. A representative request shape looks like this:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
response = client.chat.completions.create(
model="ibm-granite/granite-4.0-1b-speech",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Transcribe the audio in English."},
{"type": "input_audio", "input_audio": {"data": "...base64 audio...", "format": "wav"}}
]
}
],
temperature=0
)
print(response.choices[0].message.content)
Use vLLM when you need concurrency and a service boundary. Use Transformers when you need direct local control and easier debugging.
Running on Apple Silicon with MLX
The model card notes support for mlx-audio>=0.4.1 and points to quantized MLX community variants. This is the most practical route for Mac-based edge deployments.
pip install "mlx-audio>=0.4.1"
Then use the MLX-compatible Granite Speech variant from the model card references. Quantized MLX builds reduce memory pressure and improve local usability on MacBook-class hardware.
This runtime choice matters more than many teams expect. A smaller model with a runtime optimized for your target device often beats a larger model that only fits awkwardly. The same principle shows up across local AI engineering work, not just speech.
Configuration choices that affect quality and latency
Granite 4.0 1B Speech is designed around efficient inference. The model card exposes several useful implementation details:
| Setting or detail | Value |
|---|---|
| Speech encoder | 16 Conformer blocks |
| Input dimension | 160 |
| Hidden dimension | 1024 |
| Attention heads | 8 |
| Conv kernel size | 15 |
| Audio block attention | 4-second blocks |
| Acoustic embedding rate | 10 Hz |
| Base text model context | 128k |
There is also a decoding optimization behind the speed claims. IBM’s follow-up paper on self-speculative decoding describes a flow where the CTC encoder drafts a transcript, the model accepts low-entropy frames directly, and the LLM verifies the hypothesis in one forward pass before falling back to autoregressive decoding when needed. The paper reports a 4.4x inverse real-time factor improvement with only a 12 percent relative WER increase over AR search on their evaluation setup.
For deployment planning, the practical takeaway is simple:
- Use deterministic decoding first,
temperature=0ordo_sample=False - Benchmark with your real audio, especially meetings and phone-quality speech
- Test keyword biasing on entity-heavy audio
- Compare transcription and direct translation separately
- Measure end-to-end latency, not just model generation time
Limitations and tradeoffs
This model is compact, but it is still a multimodal speech-language model. You should plan around a few constraints.
| Constraint | Practical impact |
|---|---|
| Language support is focused | Best fit is the six listed ASR languages |
| Translation coverage is narrower than global speech APIs | Validate each language pair before production |
| Long or noisy meetings still need chunking and evaluation | Build segmentation into your pipeline |
| Prompt wording affects output style | Keep prompts stable and tested |
| Runtime behavior differs across Transformers, vLLM, and MLX | Benchmark on the exact stack you will ship |
It is also worth noting that leaderboard WER does not equal your business accuracy. Product names, accents, call-center noise, and overlapping speech can dominate error rates. Evaluate transcripts the same way you evaluate any AI output, with task-specific metrics and samples from production traffic.
When Granite 4.0 1B Speech is the right choice
Use this model when you need:
- A compact open model for local or edge speech inference
- Multilingual ASR in the six supported languages
- English-centered speech translation
- Better handling for names and acronyms through keyword biasing
- Flexible deployment in Transformers, vLLM, or MLX
If your main requirement is the broadest possible language coverage, you will likely need to compare it with larger speech models or hosted APIs. If your requirement is low-footprint deployment with strong English and multilingual coverage in a small package, Granite 4.0 1B Speech is a strong candidate.
Start with the Transformers path and a small evaluation set of your own audio. Once prompts and keyword lists are stable, move the same workload to vLLM for shared serving or MLX for Apple Silicon edge deployment, then track WER, latency, and entity accuracy for a week before expanding traffic.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
Hugging Face Reports Chinese Open Models Overtook U.S. on Hub as Qwen and DeepSeek Drive Derivative Boom
Hugging Face's Spring 2026 report says Chinese open models now lead Hub adoption, with Qwen and DeepSeek powering a surge in derivatives.
H Company Releases Holotron-12B Computer-Use Agent on Hugging Face
H Company released Holotron-12B, a Nemotron-based multimodal computer-use model touting higher throughput and 80.5% on WebVoyager.
How to Deploy Mistral Small 4 for Multimodal Reasoning and Coding
Learn how to deploy Mistral Small 4 with reasoning controls, multimodal input, and optimized serving on API, Hugging Face, or NVIDIA.
How to Get Started with Open-H, GR00T-H, and Cosmos-H for Healthcare Robotics Research
Learn how to use NVIDIA's new Open-H dataset and GR00T-H and Cosmos-H models to build and evaluate healthcare robotics systems.
How to Use Claude Across Excel and PowerPoint with Shared Context and Skills
Learn how to use Claude's shared Excel and PowerPoint context, Skills, and enterprise gateways for faster analyst workflows.