Ai Agents 5 min read

Gemini 3.1 Flash Live Launches for Real-Time Audio AI

Google launched Gemini 3.1 Flash Live, a low-latency audio-to-audio model for real-time dialogue, voice agents, and Search Live.

Google released Gemini 3.1 Flash Live on March 26 as a preview audio-to-audio model for real-time dialogue, exposed in the Gemini API as gemini-3.1-flash-live-preview. For developers building voice agents, call automation, or multimodal assistants, this launch matters because Google paired the model update with published benchmarks, production pricing, and immediate availability through the Gemini Live API.

Product scope

Gemini 3.1 Flash Live is Google’s latest low-latency voice model for live interaction. It launched in preview for developers in Google AI Studio, in Gemini Enterprise for Customer Experience, and in consumer products including Search Live and Gemini Live, with Search Live expanding to more than 200 countries and territories where AI Mode is available.

Google positions the model around four concrete improvements: lower-latency interaction, better detection of acoustic nuance, stronger reasoning in voice conversations, and better task execution under real conversational conditions. If you build systems that depend on interruption handling, live tool use, or long spoken workflows, those are the parts of the stack that moved.

Benchmark performance

Google published a stronger benchmark case than usual for a voice release, especially around function calling and multi-step execution.

BenchmarkGemini 3.1 Flash LivePrior Gemini baselineNotable comparison
ComplexFuncBench Audio90.8%Gemini 2.5 Flash Native Audio 12-2025: 71.5%Gemini 2.5 Flash Native Audio 09-2025: 66.0%
Audio Multi Challenge, Thinking High36.1%Gemini 2.5 Flash Native Audio 12-2025, Thinking High: 21.5%GPT-Realtime 1.5: 34.7%
Big Bench Audio, Speech Reasoning, Thinking High95.9%Gemini 2.5 Flash Native Audio 12-2025, Thinking High: 90.7%Step-Audio R1.1 (Realtime): 97.0%

The most important number is 90.8% on ComplexFuncBench Audio. Google adapted the benchmark to audio prompts, and the gain over its December 2025 native audio model is large. If your voice agent spends more time invoking tools than chatting, this is the result to pay attention to. It suggests the model upgrade is aimed at operational reliability, not just voice quality.

The Audio Multi Challenge result is also useful because it shows the effect of reasoning configuration. Gemini 3.1 Flash Live scored 36.1% with thinking set to high and 26.8% with thinking minimal. For production teams, this is the same tradeoff already familiar from text agents, capability rises with more reasoning, and cost and latency usually rise with it. Work on evaluating agents and function calling applies directly here.

API details that affect implementation

The Live API uses a stateful WebSocket connection. It supports audio, image, and text inputs, with audio input in raw 16-bit PCM, 16kHz, little-endian and audio output in raw 16-bit PCM, 24kHz, little-endian.

Google lists 70 supported languages for Live API conversations. The platform also supports barge-in, tool use, audio transcriptions, proactive audio, and affective dialog.

Those details matter at the integration boundary. If your existing voice stack is optimized around different sample rates or browser-native formats, you will need a resampling step. If you want direct client connections, Google recommends ephemeral tokens rather than shipping standard API keys in production apps.

Google is also signaling ecosystem intent through supported integrations including LiveKit, Pipecat by Daily, Fishjam by Software Mansion, Vision Agents by Stream, Voximplant, and the Firebase AI SDK. If you already use one of those frameworks, the path from prototype to production is shorter than building transport and session orchestration yourself. For teams choosing orchestration layers, the tradeoffs look similar to broader agent framework decisions, except real-time media transport becomes part of the architecture.

Pricing and production math

Google published pricing for the preview model immediately:

MeterPrice
Text input$0.75 / 1M tokens
Audio input$3.00 / 1M tokens or $0.005/min
Image/video input$1.00 / 1M tokens or $0.002/min
Text output, including thinking tokens$4.50 / 1M tokens
Audio output$12.00 / 1M tokens or $0.018/min
Grounding with Google Search$14 / 1,000 queries after free allotment

The expensive side of live voice remains output audio. If you run long customer support sessions, outbound speech is where your bill grows fastest. If you build high-volume assistants, this is where techniques from reducing API costs start to matter, especially around session design, turn length, and when you really need spoken output instead of text.

Product implications

Gemini 3.1 Flash Live is the clearest signal yet that Google wants one voice model family spanning API developers, enterprise CX deployments, and consumer products. The same model family powering Search Live and Gemini Live gives Google a large deployment surface for latency tuning, multilingual behavior, and interruption handling.

Google also says generated audio is watermarked with SynthID. If you ship voice experiences in regulated or public-facing settings, provenance is becoming part of the product decision, not an optional safety add-on.

If you are building voice agents in 2026, test gemini-3.1-flash-live-preview on the tasks that usually fail in production: interrupted turns, noisy speech, multi-step tool calls, and emotional escalation. Those are the areas Google is explicitly targeting, and they are the areas most likely to determine whether your system behaves like a demo or a dependable agent.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading