How to Use Amazon Polly's Bidirectional Streaming API

Amazon Polly’s new Bidirectional Streaming API reduces the latency of real-time voice agents by synthesizing audio while still receiving text. The update introduces the StartSpeechSynthesisStream operation over HTTP/2. You can now stream text tokens directly from an LLM to Polly and play the resulting audio concurrently. Here is how to configure the API, manage speech timing, and navigate the current SDK limitations.

How Bidirectional Streaming Works

Traditional Text-to-Speech requires complete text before synthesis begins. This creates an input bottleneck for conversational AI. The new API eliminates this delay by using full-duplex communication over HTTP/2. You can send text word-by-word or token-by-token as the LLM generates it.

This approach pairs perfectly when you stream LLM responses directly to the synthesis engine. The API processes the inbound text stream while simultaneously returning an outbound audio event stream over a single connection. By processing the streams concurrently, the integration natively mirrors human conversational speeds.

This native bidirectional support also simplifies cloud architecture. Developers building AI agents previously relied on complex Lambda-based workarounds to manage and stitch together small audio chunks. The new API removes the need for this intermediate processing layer entirely.

Supported SDKs and Regions

The bidirectional API requires an HTTP/2 compatible AWS SDK. Several common environments do not currently support the bidirectional streaming operation.

Category	Supported Options
AWS SDKs	Java-2x, JavaScript v3, .NET v4, C++, Go v2, Kotlin, PHP v3, Ruby v3, Rust, Swift
Regions	US East (N. Virginia), US West (Oregon), Europe (Frankfurt), Asia Pacific (Singapore)
Not Supported	Python, AWS CLI (v1/v2), PowerShell, .NET v3

Applications deployed outside the four supported regions will incur cross-region latency. Routing audio streams across regions can negate the performance benefits of the bidirectional protocol.

Generative Engine Requirements

Bidirectional streaming is exclusively available for Polly’s Generative engine. You must specify this engine in your request configuration to use the StartSpeechSynthesisStream operation.

The API supports a wide range of voices and locales on this engine. This includes the 10 highly expressive generative voices recently added to the service, such as Tiffany, Brian, Aria, and Jasmine. These generative voices span eight locales, including American English, British English, NZ English, Singapore English, French, Italian, German, and Swiss German. Review the exact parameters for calling these voices in the Amazon Polly documentation.

Managing Speech Timing with Flush Configuration

Streaming token-by-token requires precise control over when the synthesized audio actually plays. The API includes a Flush configuration to manage this pacing.

Invoking a flush command forces the API to immediately synthesize all currently buffered text. This prevents the generative engine from waiting for additional context before speaking. You can use flush triggers at natural conversational pauses, such as punctuation marks or sentence boundaries, to maintain a realistic cadence. Proper use of flush commands dictates how natural the pacing sounds to the end user.

Tradeoffs and Limitations

The strict SDK requirements dictate backend architecture. The lack of Python support means many AI backend services cannot use the feature natively. Teams using Python for their primary LLM orchestration must route their text streams through an intermediate service built in Node.js, Go, or Rust to access the API.

Update your target AWS SDK to the latest version to access the StartSpeechSynthesisStream API. Map your LLM output tokens to the supported SDK input stream, and configure your flush triggers around standard sentence boundaries to optimize the audio cadence.

How to Use Amazon Polly's Bidirectional Streaming API

How Bidirectional Streaming Works

Supported SDKs and Regions

Generative Engine Requirements

Managing Speech Timing with Flush Configuration

Tradeoffs and Limitations

Keep Reading

Sub-100ms Gemma 4 Voice Pipelines Hit Cerebras CS-3

How to Scale PyTorch Training With AWS Building Blocks

Claude Platform Goes GA on AWS With Native API Parity

xAI Ships 2-Minute Voice Clones and Grok 4.3 APIs

Gemini 1.5 Flash Now Does Real-Time Voice