Ai Engineering 2 min read

Sub-100ms Gemma 4 Voice Pipelines Hit Cerebras CS-3

Hugging Face and Cerebras have released a modular speech-to-speech pipeline that achieves sub-100 millisecond voice AI using the Gemma-4-31B model.

Hugging Face and Cerebras Systems have launched a new speech-to-speech architecture capable of sub-100 millisecond response times. Detailed in a technical integration announcement, the collaboration relies on running Google’s Gemma 4 model family entirely in the on-chip memory of the Cerebras CS-3 wafer-scale engine. This hardware bypasses the traditional data-shuffling delays of GPU clusters.

The reference implementation replaces monolithic end-to-end models with a cascaded pipeline. The system utilizes NVIDIA Parakeet TDT (0.6B) for near-instant speech-to-text transcription. The textual output is routed to either the dense Google DeepMind Gemma-4-31B or its 26B Mixture-of-Experts variant. Finally, Alibaba Qwen3-TTS handles synthesis. If you are building multi-agent systems, decoupling the pipeline allows you to swap individual components without retraining the entire stack.

Running the dense 31B model on the Cerebras CS-3 yields inference speeds of 700 tokens per second. The wafer-scale engine provides 850,000 AI cores and 44GB of SRAM, allowing the entire model state to reside on the chip. Developers can access this specific hardware configuration via the Hugging Face Inference Providers API at google/gemma-4-31B-it:cerebras.

The latency profile changes the viability of autonomous physical hardware. Pollen Robotics currently deploys this stack in 9,000 active Reachy Mini humanoid robots, where cascaded speech pipelines are critical for natural human interaction. While Cerebras provides the raw compute capacity for 1.5-second total loop latency, local inference architectures are beginning to narrow the gap for consumer setups.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading