Ai Engineering 9 min read

How to Run IBM Granite 4.0 1B Speech for Multilingual Edge ASR and Translation

Learn how to deploy IBM Granite 4.0 1B Speech for fast multilingual ASR and translation on edge devices.

IBM Granite 4.0 1B Speech gives you a compact open model for multilingual automatic speech recognition (ASR) and speech translation that is small enough for edge-oriented deployments. Released on Hugging Face in early March 2026, it adds Japanese support, keyword biasing, and native runtime support across Transformers, vLLM, and MLX. The official announcement and model card cover the full feature set. This walkthrough shows how to run it locally, transcribe audio, translate speech, and choose the right runtime for your hardware.

What Granite 4.0 1B Speech supports

Granite 4.0 1B Speech is a 1B-parameter speech-language model built for:

  • ASR in English, French, German, Spanish, Portuguese, and Japanese
  • Bidirectional speech translation involving English and those languages
  • Intended translation support including English↔Italian and English↔Mandarin in the model card
  • Keyword list biasing for names, acronyms, and domain terms

IBM reports these benchmark numbers on the Hugging Face Open ASR leaderboard:

MetricValue
Average WER5.52
RTFx280.02
LibriSpeech Clean WER1.42
LibriSpeech Other WER2.85
AMI WER8.44
Earnings22 WER8.48
VoxPopuli WER5.84

The model also uses granite-4.0-1b-base underneath, with a 128k context length. If you work with long transcripts or agent pipelines that need to preserve large histories, the context tradeoffs are similar to other long-context systems discussed in Context Windows Explained: Why Your AI Forgets.

Installation and setup with Transformers

The model card specifies transformers>=4.52.1. Start with a Python environment that has PyTorch, Transformers, and audio utilities.

python -m venv .venv
source .venv/bin/activate

pip install --upgrade pip
pip install "transformers>=4.52.1" torch torchaudio accelerate soundfile librosa

If you plan to run on Apple Silicon with MLX later, keep that environment separate. The MLX path uses different packages.

You also need a Hugging Face login if the runtime expects authenticated downloads:

huggingface-cli login

Basic ASR with a local audio file

For the simplest path, load the model and processor from Hugging Face, then send an audio waveform plus a text prompt that requests transcription.

import torch
import soundfile as sf
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

model_id = "ibm-granite/granite-4.0-1b-speech"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    device_map="auto"
)

audio, sr = sf.read("sample_en.wav")

prompt = "Transcribe the audio in English."

inputs = processor(
    text=prompt,
    audios=audio,
    sampling_rate=sr,
    return_tensors="pt"
)

inputs = {k: v.to(model.device) if hasattr(v, "to") else v for k, v in inputs.items()}

with torch.no_grad():
    generated_ids = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False
    )

text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(text)

This pattern is straightforward. Load audio, provide a task instruction, generate text.

If your application already has a local model execution pipeline, this fits naturally into the same deployment pattern as other local model workflows described in How to Run LLMs Locally on Your Machine.

Multilingual transcription prompts

Granite is instruction-following, so task prompts matter. Use explicit prompts for the source language and output format.

Examples:

Use casePrompt
English ASRTranscribe the audio in English.
Japanese ASRTranscribe the audio in Japanese.
French ASR, preserve punctuationTranscribe the audio in French with punctuation.
Domain transcriptTranscribe the audio in German. Preserve product names and acronyms.

Example for Japanese:

audio, sr = sf.read("sample_ja.wav")

inputs = processor(
    text="Transcribe the audio in Japanese.",
    audios=audio,
    sampling_rate=sr,
    return_tensors="pt"
)

inputs = {k: v.to(model.device) if hasattr(v, "to") else v for k, v in inputs.items()}

with torch.no_grad():
    generated_ids = model.generate(**inputs, max_new_tokens=256, do_sample=False)

result = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(result)

For production systems, keep prompts templated and versioned. Prompt drift changes transcript style, punctuation, and language behavior. The same operational discipline applies to speech models as it does to text models in Prompt Engineering Guide: How to Write Better AI Prompts.

Speech translation with Granite 4.0 1B Speech

The model supports automatic speech translation (AST), including bidirectional translation involving English. You can request translation directly in the prompt.

audio, sr = sf.read("sample_es.wav")

inputs = processor(
    text="Translate the Spanish audio to English.",
    audios=audio,
    sampling_rate=sr,
    return_tensors="pt"
)

inputs = {k: v.to(model.device) if hasattr(v, "to") else v for k, v in inputs.items()}

with torch.no_grad():
    generated_ids = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False
    )

translation = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(translation)

You can invert the direction as well:

  • Translate the English audio to German.
  • Translate the Japanese audio to English.

For downstream applications, it often helps to separate transcription and translation into two evaluation steps. That gives you cleaner error analysis and makes it easier to decide whether a retrieval step or a post-processing model should be added later. If you build that kind of pipeline, the same orchestration choices show up in AI Agent Frameworks Compared: LangChain vs CrewAI vs LlamaIndex.

Keyword biasing for names and acronyms

One of the most useful additions in this release is keyword list biasing. IBM calls out better handling for names and acronyms, which matters for enterprise audio, support calls, internal meetings, and medical or technical vocabulary.

The exact prompt format can evolve with runtime support, but the practical pattern is to inject a controlled list into the instruction:

keywords = ["GetAIBook", "MCP", "Kubernetes", "PostgreSQL", "Granite", "ETL"]

prompt = (
    "Transcribe the audio in English. "
    "Prefer these keywords when relevant: "
    + ", ".join(keywords)
    + ". Preserve capitalization for acronyms."
)

audio, sr = sf.read("meeting.wav")

inputs = processor(
    text=prompt,
    audios=audio,
    sampling_rate=sr,
    return_tensors="pt"
)

inputs = {k: v.to(model.device) if hasattr(v, "to") else v for k, v in inputs.items()}

with torch.no_grad():
    generated_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)

transcript = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(transcript)

This is especially useful when the cost of a wrong entity is higher than the cost of a minor punctuation mistake. If your application later maps transcripts into structured records, pair transcription with explicit output validation, similar to the approach in Structured Output from LLMs: JSON Mode Explained.

Running Granite Speech with vLLM

IBM also lists vLLM support, which is the better fit when you need a server endpoint, request batching, or a shared inference service.

Install vLLM in a dedicated environment:

pip install vllm

Then start a server for the model:

python -m vllm.entrypoints.openai.api_server \
  --model ibm-granite/granite-4.0-1b-speech \
  --trust-remote-code

A client request typically includes both text and audio content. The exact multimodal message schema can vary by vLLM version, so check the model card examples first. A representative request shape looks like this:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

response = client.chat.completions.create(
    model="ibm-granite/granite-4.0-1b-speech",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Transcribe the audio in English."},
                {"type": "input_audio", "input_audio": {"data": "...base64 audio...", "format": "wav"}}
            ]
        }
    ],
    temperature=0
)

print(response.choices[0].message.content)

Use vLLM when you need concurrency and a service boundary. Use Transformers when you need direct local control and easier debugging.

Running on Apple Silicon with MLX

The model card notes support for mlx-audio>=0.4.1 and points to quantized MLX community variants. This is the most practical route for Mac-based edge deployments.

pip install "mlx-audio>=0.4.1"

Then use the MLX-compatible Granite Speech variant from the model card references. Quantized MLX builds reduce memory pressure and improve local usability on MacBook-class hardware.

This runtime choice matters more than many teams expect. A smaller model with a runtime optimized for your target device often beats a larger model that only fits awkwardly. The same principle shows up across local AI engineering work, not just speech.

Configuration choices that affect quality and latency

Granite 4.0 1B Speech is designed around efficient inference. The model card exposes several useful implementation details:

Setting or detailValue
Speech encoder16 Conformer blocks
Input dimension160
Hidden dimension1024
Attention heads8
Conv kernel size15
Audio block attention4-second blocks
Acoustic embedding rate10 Hz
Base text model context128k

There is also a decoding optimization behind the speed claims. IBM’s follow-up paper on self-speculative decoding describes a flow where the CTC encoder drafts a transcript, the model accepts low-entropy frames directly, and the LLM verifies the hypothesis in one forward pass before falling back to autoregressive decoding when needed. The paper reports a 4.4x inverse real-time factor improvement with only a 12 percent relative WER increase over AR search on their evaluation setup.

For deployment planning, the practical takeaway is simple:

  • Use deterministic decoding first, temperature=0 or do_sample=False
  • Benchmark with your real audio, especially meetings and phone-quality speech
  • Test keyword biasing on entity-heavy audio
  • Compare transcription and direct translation separately
  • Measure end-to-end latency, not just model generation time

Limitations and tradeoffs

This model is compact, but it is still a multimodal speech-language model. You should plan around a few constraints.

ConstraintPractical impact
Language support is focusedBest fit is the six listed ASR languages
Translation coverage is narrower than global speech APIsValidate each language pair before production
Long or noisy meetings still need chunking and evaluationBuild segmentation into your pipeline
Prompt wording affects output styleKeep prompts stable and tested
Runtime behavior differs across Transformers, vLLM, and MLXBenchmark on the exact stack you will ship

It is also worth noting that leaderboard WER does not equal your business accuracy. Product names, accents, call-center noise, and overlapping speech can dominate error rates. Evaluate transcripts the same way you evaluate any AI output, with task-specific metrics and samples from production traffic.

When Granite 4.0 1B Speech is the right choice

Use this model when you need:

  • A compact open model for local or edge speech inference
  • Multilingual ASR in the six supported languages
  • English-centered speech translation
  • Better handling for names and acronyms through keyword biasing
  • Flexible deployment in Transformers, vLLM, or MLX

If your main requirement is the broadest possible language coverage, you will likely need to compare it with larger speech models or hosted APIs. If your requirement is low-footprint deployment with strong English and multilingual coverage in a small package, Granite 4.0 1B Speech is a strong candidate.

Start with the Transformers path and a small evaluation set of your own audio. Once prompts and keyword lists are stable, move the same workload to vLLM for shared serving or MLX for Apple Silicon edge deployment, then track WER, latency, and entity accuracy for a week before expanding traffic.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading