Cascaded Speech Pipeline Brings Reachy Mini Inference Local
Hugging Face released an offline conversational stack for the Reachy Mini robot that replaces cloud APIs with a local pipeline built on Gemma 4 and Qwen3-TTS.
On May 27, Hugging Face released a fully local conversational update for the open-source Reachy Mini desktop robot. The update replaces cloud-based dependencies with an offline speech-to-speech library, enabling developers to process audio and reasoning locally without external API keys.
Cascaded Pipeline Architecture
The local stack avoids monolithic end-to-end models in favor of a modular cascaded pipeline. This structure allows developers to swap components as new weights become available for specific processing stages.
| Component | Default Model | Purpose |
|---|---|---|
| Voice Activity (VAD) | Silero VAD | Speech boundary detection |
| Speech-to-Text (STT) | Parakeet-TDT | High-speed audio transcription |
| Language Model (LLM) | Gemma 4 (gemma-4-E4B-it-GGUF) | Dialog reasoning |
| Text-to-Speech (TTS) | Qwen3-TTS | Expressive audio generation |
The core engine relies on llama-server to host Gemma 4 with a 64k context window. The server configuration includes parallel slot support, enabling the system to handle user interruptions naturally during playback without losing session state.
Hardware and Deployment
The architecture splits the compute burden. The robot hardware, powered by a Raspberry Pi 5, handles motor control for the 6 degrees of freedom in its head, alongside audio capture through its XMOS XVF3800 4-microphone array. The system relies on a separate workstation to run LLMs locally. Communication between the hardware nodes utilizes FastRTC for low-latency audio streaming.
To simplify migration for developers building real-time voice agents, the local backend exposes a WebSocket connection formatted to match the OpenAI Realtime API (/v1/realtime). Users can boot the stack using Reachy Mini SDK 1.7.1. For multimodal tasks, adding the --local-vision flag activates SmolVLM2 for on-device image processing.
If your robotics application currently relies on remote inference, you can migrate your existing endpoints to the new local server by updating the base URL in your SDK configuration. This targets the local hardware directly, decoupling the robot’s reaction time from network latency.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Cut Checkpoint Time by 85% With TRL Delta Weight Sync
Learn how to configure TRL Delta Weight Sync to reduce trillion-parameter model checkpointing times by 85 percent using Hugging Face Hub Buckets.
Cohere Transcribe debuts as open-source ASR model
Cohere Transcribe launches as a 2B open-source speech-to-text model with 14-language support, self-hosting, and vLLM serving.
How to Run Gemma 4 On-Device with LiteRT-LM
Learn how to configure LiteRT-LM to deploy the Gemma 4 model family locally across mobile, desktop, and edge environments with constrained JSON decoding.
Apache 2.0 Gets 218B Command A+ as Cohere Acquires Reliant AI
Cohere expanded its sovereign AI strategy by open-sourcing the 218-billion parameter Command A+ model and acquiring biopharma startup Reliant AI.
AI Edge Gallery for Android Gains On-Device MCP and Gemma 4
Google updated the AI Edge Gallery Android app with experimental Model Context Protocol support, enabling on-device Gemma 4 models to use external web tools.