How to Run IBM Granite 4.0 1B Speech for Multilingual Edge ASR and Translation

IBM Granite 4.0 1B Speech gives you a compact open model for multilingual automatic speech recognition (ASR) and speech translation that is small enough for edge-oriented deployments. Released on Hugging Face in early March 2026, it adds Japanese support, keyword biasing, and native runtime support across Transformers, vLLM, and MLX. The official announcement and model card cover the full feature set. This walkthrough shows how to run it locally, transcribe audio, translate speech, and choose the right runtime for your hardware.

What Granite 4.0 1B Speech supports

Granite 4.0 1B Speech is a 1B-parameter speech-language model built for:

ASR in English, French, German, Spanish, Portuguese, and Japanese
Bidirectional speech translation involving English and those languages
Intended translation support including English↔Italian and English↔Mandarin in the model card
Keyword list biasing for names, acronyms, and domain terms

IBM reports these benchmark numbers on the Hugging Face Open ASR leaderboard:

Metric	Value
Average WER	5.52
RTFx	280.02
LibriSpeech Clean WER	1.42
LibriSpeech Other WER	2.85
AMI WER	8.44
Earnings22 WER	8.48
VoxPopuli WER	5.84

The model also uses granite-4.0-1b-base underneath, with a 128k context length. If you work with long transcripts or agent pipelines that need to preserve large histories, the context tradeoffs are similar to other long-context systems discussed in Context Windows Explained: Why Your AI Forgets.

Installation and setup with Transformers

The model card specifies transformers>=4.52.1. Start with a Python environment that has PyTorch, Transformers, and audio utilities.

python -m venv .venv
source .venv/bin/activate

pip install --upgrade pip
pip install "transformers>=4.52.1" torch torchaudio accelerate soundfile librosa

If you plan to run on Apple Silicon with MLX later, keep that environment separate. The MLX path uses different packages.

You also need a Hugging Face login if the runtime expects authenticated downloads:

huggingface-cli login

Basic ASR with a local audio file

For the simplest path, load the model and processor from Hugging Face, then send an audio waveform plus a text prompt that requests transcription.

import torch
import soundfile as sf
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

model_id = "ibm-granite/granite-4.0-1b-speech"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    device_map="auto"
)

audio, sr = sf.read("sample_en.wav")

prompt = "Transcribe the audio in English."

inputs = processor(
    text=prompt,
    audios=audio,
    sampling_rate=sr,
    return_tensors="pt"
)

inputs = {k: v.to(model.device) if hasattr(v, "to") else v for k, v in inputs.items()}

with torch.no_grad():
    generated_ids = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False
    )

text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(text)

This pattern is straightforward. Load audio, provide a task instruction, generate text.

If your application already has a local model execution pipeline, this fits naturally into the same deployment pattern as other local model workflows described in How to Run LLMs Locally on Your Machine.

Multilingual transcription prompts

Granite is instruction-following, so task prompts matter. Use explicit prompts for the source language and output format.

Examples:

Use case	Prompt
English ASR	`Transcribe the audio in English.`
Japanese ASR	`Transcribe the audio in Japanese.`
French ASR, preserve punctuation	`Transcribe the audio in French with punctuation.`
Domain transcript	`Transcribe the audio in German. Preserve product names and acronyms.`

Example for Japanese:

audio, sr = sf.read("sample_ja.wav")

inputs = processor(
    text="Transcribe the audio in Japanese.",
    audios=audio,
    sampling_rate=sr,
    return_tensors="pt"
)

inputs = {k: v.to(model.device) if hasattr(v, "to") else v for k, v in inputs.items()}

with torch.no_grad():
    generated_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False)

result = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(result)

For production systems, keep prompts templated and versioned. Prompt drift changes transcript style, punctuation, and language behavior. The same operational discipline applies to speech models as it does to text models in Prompt Engineering Guide: How to Write Better AI Prompts.

Speech translation with Granite 4.0 1B Speech

The model supports automatic speech translation (AST), including bidirectional translation involving English. You can request translation directly in the prompt.

audio, sr = sf.read("sample_es.wav")

inputs = processor(
    text="Translate the Spanish audio to English.",
    audios=audio,
    sampling_rate=sr,
    return_tensors="pt"
)

inputs = {k: v.to(model.device) if hasattr(v, "to") else v for k, v in inputs.items()}

with torch.no_grad():
    generated_ids = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False
    )

translation = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(translation)

You can invert the direction as well:

Translate the English audio to German.
Translate the Japanese audio to English.

For downstream applications, it often helps to separate transcription and translation into two evaluation steps. That gives you cleaner error analysis and makes it easier to decide whether a retrieval step or a post-processing model should be added later. If you build that kind of pipeline, the same orchestration choices show up in AI Agent Frameworks Compared: LangChain vs CrewAI vs LlamaIndex.

Keyword biasing for names and acronyms

One of the most useful additions in this release is keyword list biasing. IBM calls out better handling for names and acronyms, which matters for enterprise audio, support calls, internal meetings, and medical or technical vocabulary.

The exact prompt format can evolve with runtime support, but the practical pattern is to inject a controlled list into the instruction:

keywords = ["GetAIBook", "MCP", "Kubernetes", "PostgreSQL", "Granite", "ETL"]

prompt = (
    "Transcribe the audio in English. "
    "Prefer these keywords when relevant: "
    + ", ".join(keywords)
    + ". Preserve capitalization for acronyms."
)

audio, sr = sf.read("meeting.wav")

inputs = processor(
    text=prompt,
    audios=audio,
    sampling_rate=sr,
    return_tensors="pt"
)

inputs = {k: v.to(model.device) if hasattr(v, "to") else v for k, v in inputs.items()}

with torch.no_grad():
    generated_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)

transcript = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(transcript)

This is especially useful when the cost of a wrong entity is higher than the cost of a minor punctuation mistake. If your application later maps transcripts into structured records, pair transcription with explicit output validation, similar to the approach in Structured Output from LLMs: JSON Mode Explained.

Running Granite Speech with vLLM

IBM also lists vLLM support, which is the better fit when you need a server endpoint, request batching, or a shared inference service.

Install vLLM in a dedicated environment:

pip install vllm

Then start a server for the model:

python -m vllm.entrypoints.openai.api_server \
  --model ibm-granite/granite-4.0-1b-speech \
  --trust-remote-code

A client request typically includes both text and audio content. The exact multimodal message schema can vary by vLLM version, so check the model card examples first. A representative request shape looks like this:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

response = client.chat.completions.create(
    model="ibm-granite/granite-4.0-1b-speech",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Transcribe the audio in English."},
                {"type": "input_audio", "input_audio": {"data": "...base64 audio...", "format": "wav"}}
            ]
        }
    ],
    temperature=0
)

print(response.choices[0].message.content)

Use vLLM when you need concurrency and a service boundary. Use Transformers when you need direct local control and easier debugging.

Running on Apple Silicon with MLX

The model card notes support for mlx-audio>=0.4.1 and points to quantized MLX community variants. This is the most practical route for Mac-based edge deployments.

pip install "mlx-audio>=0.4.1"

Then use the MLX-compatible Granite Speech variant from the model card references. Quantized MLX builds reduce memory pressure and improve local usability on MacBook-class hardware.

This runtime choice matters more than many teams expect. A smaller model with a runtime optimized for your target device often beats a larger model that only fits awkwardly. The same principle shows up across local AI engineering work, not just speech.

Configuration choices that affect quality and latency

Granite 4.0 1B Speech is designed around efficient inference. The model card exposes several useful implementation details:

Setting or detail	Value
Speech encoder	16 Conformer blocks
Input dimension	160
Hidden dimension	1024
Attention heads	8
Conv kernel size	15
Audio block attention	4-second blocks
Acoustic embedding rate	10 Hz
Base text model context	128k

There is also a decoding optimization behind the speed claims. IBM’s follow-up paper on self-speculative decoding describes a flow where the CTC encoder drafts a transcript, the model accepts low-entropy frames directly, and the LLM verifies the hypothesis in one forward pass before falling back to autoregressive decoding when needed. The paper reports a 4.4x inverse real-time factor improvement with only a 12 percent relative WER increase over AR search on their evaluation setup.

For deployment planning, the practical takeaway is simple:

Use deterministic decoding first, temperature=0 or do_sample=False
Benchmark with your real audio, especially meetings and phone-quality speech
Test keyword biasing on entity-heavy audio
Compare transcription and direct translation separately
Measure end-to-end latency, not just model generation time

Limitations and tradeoffs

This model is compact, but it is still a multimodal speech-language model. You should plan around a few constraints.

Constraint	Practical impact
Language support is focused	Best fit is the six listed ASR languages
Translation coverage is narrower than global speech APIs	Validate each language pair before production
Long or noisy meetings still need chunking and evaluation	Build segmentation into your pipeline
Prompt wording affects output style	Keep prompts stable and tested
Runtime behavior differs across Transformers, vLLM, and MLX	Benchmark on the exact stack you will ship

It is also worth noting that leaderboard WER does not equal your business accuracy. Product names, accents, call-center noise, and overlapping speech can dominate error rates. Evaluate transcripts the same way you evaluate any AI output, with task-specific metrics and samples from production traffic.

When Granite 4.0 1B Speech is the right choice

Use this model when you need:

A compact open model for local or edge speech inference
Multilingual ASR in the six supported languages
English-centered speech translation
Better handling for names and acronyms through keyword biasing
Flexible deployment in Transformers, vLLM, or MLX

If your main requirement is the broadest possible language coverage, you will likely need to compare it with larger speech models or hosted APIs. If your requirement is low-footprint deployment with strong English and multilingual coverage in a small package, Granite 4.0 1B Speech is a strong candidate.

Start with the Transformers path and a small evaluation set of your own audio. Once prompts and keyword lists are stable, move the same workload to vLLM for shared serving or MLX for Apple Silicon edge deployment, then track WER, latency, and entity accuracy for a week before expanding traffic.

How to Run IBM Granite 4.0 1B Speech for Multilingual Edge ASR and Translation

What Granite 4.0 1B Speech supports

Installation and setup with Transformers

Basic ASR with a local audio file

Multilingual transcription prompts

Speech translation with Granite 4.0 1B Speech

Keyword biasing for names and acronyms

Running Granite Speech with vLLM

Running on Apple Silicon with MLX

Configuration choices that affect quality and latency

Limitations and tradeoffs

When Granite 4.0 1B Speech is the right choice

Keep Reading

Sub-100ms Gemma 4 Voice Pipelines Hit Cerebras CS-3

229,000 Standardized Benchmark Results Hit Hugging Face Models

How to Expose Ephemeral vLLM Endpoints on Hugging Face Jobs

Far-Field Benchmark Shows Massive Gap in Low SNR Speech Models

AI Automation Shifts huggingface_hub to Weekly Release Cycle