Gemini 3.1 Flash Live Launches for Real-Time Audio AI
Google launched Gemini 3.1 Flash Live, a low-latency audio-to-audio model for real-time dialogue, voice agents, and Search Live.
Google released Gemini 3.1 Flash Live on March 26 as a preview audio-to-audio model for real-time dialogue, exposed in the Gemini API as gemini-3.1-flash-live-preview. For developers building voice agents, call automation, or multimodal assistants, this launch matters because Google paired the model update with published benchmarks, production pricing, and immediate availability through the Gemini Live API.
Product scope
Gemini 3.1 Flash Live is Google’s latest low-latency voice model for live interaction. It launched in preview for developers in Google AI Studio, in Gemini Enterprise for Customer Experience, and in consumer products including Search Live and Gemini Live, with Search Live expanding to more than 200 countries and territories where AI Mode is available.
Google positions the model around four concrete improvements: lower-latency interaction, better detection of acoustic nuance, stronger reasoning in voice conversations, and better task execution under real conversational conditions. If you build systems that depend on interruption handling, live tool use, or long spoken workflows, those are the parts of the stack that moved.
Benchmark performance
Google published a stronger benchmark case than usual for a voice release, especially around function calling and multi-step execution.
| Benchmark | Gemini 3.1 Flash Live | Prior Gemini baseline | Notable comparison |
|---|---|---|---|
| ComplexFuncBench Audio | 90.8% | Gemini 2.5 Flash Native Audio 12-2025: 71.5% | Gemini 2.5 Flash Native Audio 09-2025: 66.0% |
| Audio Multi Challenge, Thinking High | 36.1% | Gemini 2.5 Flash Native Audio 12-2025, Thinking High: 21.5% | GPT-Realtime 1.5: 34.7% |
| Big Bench Audio, Speech Reasoning, Thinking High | 95.9% | Gemini 2.5 Flash Native Audio 12-2025, Thinking High: 90.7% | Step-Audio R1.1 (Realtime): 97.0% |
The most important number is 90.8% on ComplexFuncBench Audio. Google adapted the benchmark to audio prompts, and the gain over its December 2025 native audio model is large. If your voice agent spends more time invoking tools than chatting, this is the result to pay attention to. It suggests the model upgrade is aimed at operational reliability, not just voice quality.
The Audio Multi Challenge result is also useful because it shows the effect of reasoning configuration. Gemini 3.1 Flash Live scored 36.1% with thinking set to high and 26.8% with thinking minimal. For production teams, this is the same tradeoff already familiar from text agents, capability rises with more reasoning, and cost and latency usually rise with it. Work on evaluating agents and function calling applies directly here.
API details that affect implementation
The Live API uses a stateful WebSocket connection. It supports audio, image, and text inputs, with audio input in raw 16-bit PCM, 16kHz, little-endian and audio output in raw 16-bit PCM, 24kHz, little-endian.
Google lists 70 supported languages for Live API conversations. The platform also supports barge-in, tool use, audio transcriptions, proactive audio, and affective dialog.
Those details matter at the integration boundary. If your existing voice stack is optimized around different sample rates or browser-native formats, you will need a resampling step. If you want direct client connections, Google recommends ephemeral tokens rather than shipping standard API keys in production apps.
Google is also signaling ecosystem intent through supported integrations including LiveKit, Pipecat by Daily, Fishjam by Software Mansion, Vision Agents by Stream, Voximplant, and the Firebase AI SDK. If you already use one of those frameworks, the path from prototype to production is shorter than building transport and session orchestration yourself. For teams choosing orchestration layers, the tradeoffs look similar to broader agent framework decisions, except real-time media transport becomes part of the architecture.
Pricing and production math
Google published pricing for the preview model immediately:
| Meter | Price |
|---|---|
| Text input | $0.75 / 1M tokens |
| Audio input | $3.00 / 1M tokens or $0.005/min |
| Image/video input | $1.00 / 1M tokens or $0.002/min |
| Text output, including thinking tokens | $4.50 / 1M tokens |
| Audio output | $12.00 / 1M tokens or $0.018/min |
| Grounding with Google Search | $14 / 1,000 queries after free allotment |
The expensive side of live voice remains output audio. If you run long customer support sessions, outbound speech is where your bill grows fastest. If you build high-volume assistants, this is where techniques from reducing API costs start to matter, especially around session design, turn length, and when you really need spoken output instead of text.
Product implications
Gemini 3.1 Flash Live is the clearest signal yet that Google wants one voice model family spanning API developers, enterprise CX deployments, and consumer products. The same model family powering Search Live and Gemini Live gives Google a large deployment surface for latency tuning, multilingual behavior, and interruption handling.
Google also says generated audio is watermarked with SynthID. If you ship voice experiences in regulated or public-facing settings, provenance is becoming part of the product decision, not an optional safety add-on.
If you are building voice agents in 2026, test gemini-3.1-flash-live-preview on the tasks that usually fail in production: interrupted turns, noisy speech, multi-step tool calls, and emotional escalation. Those are the areas Google is explicitly targeting, and they are the areas most likely to determine whether your system behaves like a demo or a dependable agent.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Speed Up Regex Search for AI Agents
Learn how Cursor uses local sparse n-gram indexes to make regex search fast enough for interactive AI agent workflows.
Voxtral TTS: Mistral's Open-Source Answer to Voice Agents
Mistral’s reported Voxtral TTS release could help developers build low-latency, open-source voice apps and agents on edge devices.
ServiceNow Ships a Benchmark for Testing Enterprise Voice Agents
ServiceNow AI released EVA, an open-source benchmark for evaluating voice agents on both task accuracy and spoken interaction quality.
Google DeepMind Unveils AGI Cognitive Evaluation Framework and Launches $200,000 Kaggle Hackathon
Google DeepMind introduced a 10-faculty framework for measuring AGI progress and opened a $200,000 Kaggle evaluation hackathon.
Kimi K2.5 Is the First Large Model on Cloudflare Workers AI
Cloudflare Workers AI now serves Kimi K2.5 with 256k context, tool calling, prompt caching metrics, session affinity, and batch inference.