How to Use Amazon Polly's Bidirectional Streaming API
Learn how to use Amazon Polly’s new HTTP/2 bidirectional streaming to reduce latency in real-time conversational AI by streaming text and audio simultaneously.
Amazon Polly’s new Bidirectional Streaming API reduces the latency of real-time voice agents by synthesizing audio while still receiving text. The update introduces the StartSpeechSynthesisStream operation over HTTP/2. You can now stream text tokens directly from an LLM to Polly and play the resulting audio concurrently. Here is how to configure the API, manage speech timing, and navigate the current SDK limitations.
How Bidirectional Streaming Works
Traditional Text-to-Speech requires complete text before synthesis begins. This creates an input bottleneck for conversational AI. The new API eliminates this delay by using full-duplex communication over HTTP/2. You can send text word-by-word or token-by-token as the LLM generates it.
This approach pairs perfectly when you stream LLM responses directly to the synthesis engine. The API processes the inbound text stream while simultaneously returning an outbound audio event stream over a single connection. By processing the streams concurrently, the integration natively mirrors human conversational speeds.
This native bidirectional support also simplifies cloud architecture. Developers building AI agents previously relied on complex Lambda-based workarounds to manage and stitch together small audio chunks. The new API removes the need for this intermediate processing layer entirely.
Supported SDKs and Regions
The bidirectional API requires an HTTP/2 compatible AWS SDK. Several common environments do not currently support the bidirectional streaming operation.
| Category | Supported Options |
|---|---|
| AWS SDKs | Java-2x, JavaScript v3, .NET v4, C++, Go v2, Kotlin, PHP v3, Ruby v3, Rust, Swift |
| Regions | US East (N. Virginia), US West (Oregon), Europe (Frankfurt), Asia Pacific (Singapore) |
| Not Supported | Python, AWS CLI (v1/v2), PowerShell, .NET v3 |
Applications deployed outside the four supported regions will incur cross-region latency. Routing audio streams across regions can negate the performance benefits of the bidirectional protocol.
Generative Engine Requirements
Bidirectional streaming is exclusively available for Polly’s Generative engine. You must specify this engine in your request configuration to use the StartSpeechSynthesisStream operation.
The API supports a wide range of voices and locales on this engine. This includes the 10 highly expressive generative voices recently added to the service, such as Tiffany, Brian, Aria, and Jasmine. These generative voices span eight locales, including American English, British English, NZ English, Singapore English, French, Italian, German, and Swiss German. Review the exact parameters for calling these voices in the Amazon Polly documentation.
Managing Speech Timing with Flush Configuration
Streaming token-by-token requires precise control over when the synthesized audio actually plays. The API includes a Flush configuration to manage this pacing.
Invoking a flush command forces the API to immediately synthesize all currently buffered text. This prevents the generative engine from waiting for additional context before speaking. You can use flush triggers at natural conversational pauses, such as punctuation marks or sentence boundaries, to maintain a realistic cadence. Proper use of flush commands dictates how natural the pacing sounds to the end user.
Tradeoffs and Limitations
The strict SDK requirements dictate backend architecture. The lack of Python support means many AI backend services cannot use the feature natively. Teams using Python for their primary LLM orchestration must route their text streams through an intermediate service built in Node.js, Go, or Rust to access the API.
Update your target AWS SDK to the latest version to access the StartSpeechSynthesisStream API. Map your LLM output tokens to the supported SDK input stream, and configure your flush triggers around standard sentence boundaries to optimize the audio cadence.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
Voxtral TTS: Mistral's Open-Source Answer to Voice Agents
Mistral’s reported Voxtral TTS release could help developers build low-latency, open-source voice apps and agents on edge devices.
Malware Found in Backdoored Telnyx PyPI Package
Security researchers discovered two malicious versions of the Telnyx Python SDK on PyPI that hide data-stealing malware inside WAV audio files.
How to Visualize Cloudflare Workflows Using ASTs
Learn how Cloudflare uses Abstract Syntax Trees to transform TypeScript workflow code into interactive visual diagrams for better debugging and monitoring.
Google DeepMind Releases AI Manipulation Toolkit
DeepMind's new toolkit uses human-in-the-loop studies to measure how AI models exploit cognitive vulnerabilities and identifies key manipulation tactics.
Google Says Post-Quantum Migration Can't Wait Until 2035
Google warns that quantum computers could break RSA-2048 sooner than expected, pushing its migration deadline to 2029, years ahead of NIST's 2035 target.