Ai Engineering 2 min read

Cascaded Speech Pipeline Brings Reachy Mini Inference Local

Hugging Face released an offline conversational stack for the Reachy Mini robot that replaces cloud APIs with a local pipeline built on Gemma 4 and Qwen3-TTS.

On May 27, Hugging Face released a fully local conversational update for the open-source Reachy Mini desktop robot. The update replaces cloud-based dependencies with an offline speech-to-speech library, enabling developers to process audio and reasoning locally without external API keys.

Cascaded Pipeline Architecture

The local stack avoids monolithic end-to-end models in favor of a modular cascaded pipeline. This structure allows developers to swap components as new weights become available for specific processing stages.

ComponentDefault ModelPurpose
Voice Activity (VAD)Silero VADSpeech boundary detection
Speech-to-Text (STT)Parakeet-TDTHigh-speed audio transcription
Language Model (LLM)Gemma 4 (gemma-4-E4B-it-GGUF)Dialog reasoning
Text-to-Speech (TTS)Qwen3-TTSExpressive audio generation

The core engine relies on llama-server to host Gemma 4 with a 64k context window. The server configuration includes parallel slot support, enabling the system to handle user interruptions naturally during playback without losing session state.

Hardware and Deployment

The architecture splits the compute burden. The robot hardware, powered by a Raspberry Pi 5, handles motor control for the 6 degrees of freedom in its head, alongside audio capture through its XMOS XVF3800 4-microphone array. The system relies on a separate workstation to run LLMs locally. Communication between the hardware nodes utilizes FastRTC for low-latency audio streaming.

To simplify migration for developers building real-time voice agents, the local backend exposes a WebSocket connection formatted to match the OpenAI Realtime API (/v1/realtime). Users can boot the stack using Reachy Mini SDK 1.7.1. For multimodal tasks, adding the --local-vision flag activates SmolVLM2 for on-device image processing.

If your robotics application currently relies on remote inference, you can migrate your existing endpoints to the new local server by updating the base URL in your SDK configuration. This targets the local hardware directly, decoupling the robot’s reaction time from network latency.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading