Cascaded Speech Pipeline Brings Reachy Mini Inference Local

On May 27, Hugging Face released a fully local conversational update for the open-source Reachy Mini desktop robot. The update replaces cloud-based dependencies with an offline speech-to-speech library, enabling developers to process audio and reasoning locally without external API keys.

Cascaded Pipeline Architecture

The local stack avoids monolithic end-to-end models in favor of a modular cascaded pipeline. This structure allows developers to swap components as new weights become available for specific processing stages.

Component	Default Model	Purpose
Voice Activity (VAD)	Silero VAD	Speech boundary detection
Speech-to-Text (STT)	Parakeet-TDT	High-speed audio transcription
Language Model (LLM)	Gemma 4 (`gemma-4-E4B-it-GGUF`)	Dialog reasoning
Text-to-Speech (TTS)	Qwen3-TTS	Expressive audio generation

The core engine relies on llama-server to host Gemma 4 with a 64k context window. The server configuration includes parallel slot support, enabling the system to handle user interruptions naturally during playback without losing session state.

Hardware and Deployment

The architecture splits the compute burden. The robot hardware, powered by a Raspberry Pi 5, handles motor control for the 6 degrees of freedom in its head, alongside audio capture through its XMOS XVF3800 4-microphone array. The system relies on a separate workstation to run LLMs locally. Communication between the hardware nodes utilizes FastRTC for low-latency audio streaming.

To simplify migration for developers building real-time voice agents, the local backend exposes a WebSocket connection formatted to match the OpenAI Realtime API (/v1/realtime). Users can boot the stack using Reachy Mini SDK 1.7.1. For multimodal tasks, adding the --local-vision flag activates SmolVLM2 for on-device image processing.

If your robotics application currently relies on remote inference, you can migrate your existing endpoints to the new local server by updating the base URL in your SDK configuration. This targets the local hardware directly, decoupling the robot’s reaction time from network latency.

Cascaded Speech Pipeline Brings Reachy Mini Inference Local

Cascaded Pipeline Architecture

Hardware and Deployment

Keep Reading

How to Cut Checkpoint Time by 85% With TRL Delta Weight Sync

Cohere Transcribe debuts as open-source ASR model

How to Run Gemma 4 On-Device with LiteRT-LM

Apache 2.0 Gets 218B Command A+ as Cohere Acquires Reliant AI

AI Edge Gallery for Android Gains On-Device MCP and Gemma 4