Ai Engineering 3 min read

Build Real-Time Voice Agents with Cloudflare Agents SDK

Learn how to integrate low-latency voice interactions into your AI agents using Cloudflare's new @cloudflare/voice package and Durable Objects.

Cloudflare’s new experimental voice pipeline for the Agents SDK allows you to build real-time, low-latency voice interactions directly into your AI agents. Released during their April 2026 Agents Week, the voice pipeline relies on the same Durable Object architecture used for text interactions. You can now process conversational audio without standing up complex WebRTC infrastructure.

Architecture and Transport

The transport mechanism bypasses Selective Forwarding Units (SFUs) entirely. Microphone audio is captured in the browser and streamed directly to the agent as 16 kHz mono PCM audio using binary WebSocket frames.

To minimize conversational lag, Cloudflare colocates the agent, Speech-to-Text (STT), and Text-to-Speech (TTS) on its global network. This eliminates network hops between different infrastructure providers. The tight integration between the transport layer and the inference hardware ensures latency remains low enough for natural verbal exchanges.

Server-Side Implementation

You integrate voice capabilities server-side using the @cloudflare/voice package. The core integration requires wrapping your existing agent class with the withVoice(Agent) function. This wrapper handles the audio stream processing and state management. The official documentation contains the required integration patterns and configuration parameters for this wrapper.

Voice and text inputs share the same persistent state. The conversation history is stored in a SQLite database running inside the agent’s Durable Object. This unified state means users can switch seamlessly between typing and speaking during a single session. This is a critical architectural advantage when you configure agent memory for long-running sessions. The agent retains full context regardless of the input modality.

Built-in AI Providers

The SDK includes several default Workers AI providers that run locally on Cloudflare’s network without requiring external API keys.

  • Continuous Transcription: Conversational audio runs through Deepgram Flux.
  • Dictation Input: Standard standalone dictation tasks use Deepgram Nova 3.
  • Voice Synthesis: Low-latency voice generation relies on Deepgram Aura.

The provider interfaces remain small and modular by design. You can swap out the default models for third-party telephony services like Twilio or specialized transport layers if your architecture requires it.

These voice tools operate alongside Cloudflare’s broader agent ecosystem. You can combine voice input with other recent capabilities, such as allowing your voice agent to securely query private databases via Cloudflare Mesh or directly process communications through the Cloudflare Email Service.

Client-Side Integration

Cloudflare provides specific React hooks to manage the browser-side audio capture and playback.

The useVoiceAgent hook handles full conversational flows, managing both the microphone stream and the returning audio playback. The useVoiceInput hook handles standalone speech-to-text scenarios like dictation or voice search. For non-React applications, the VoiceClient class provides a framework-agnostic alternative to manage the WebSocket connection.

The SDK includes built-in logic for conversational interruptions. If a user begins speaking while the agent is currently playing synthesized audio, the client automatically cancels the ongoing playback. This prevents audio overlap and simulates natural conversation flow without requiring you to write custom client-side buffering and interruption logic.

Limitations and Tradeoffs

The SDK requires explicit client-side handling to format the audio stream correctly. Standard browser media stream APIs will need processing to match the strict 16 kHz mono PCM format before transmission over the WebSocket connection.

The current pipeline is tied directly to the Workers AI ecosystem for its low-latency benefits. If you are exploring broader multi-agent coordination patterns that span multiple cloud providers, integrating this specific WebSocket audio transport with external agent networks requires custom adapter layers.

Review the @cloudflare/voice package requirements in your existing Agents SDK project before deploying. Configure your Durable Objects to handle the additional SQLite storage overhead generated by persistent conversational audio transcripts.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading