Ai Engineering 2 min read

Gemini 3.1 Flash-Lite Ships 1M Context at $0.25 Per Million

Google's lowest-latency Gemini model is now generally available, introducing variable thinking levels and a 1M token context window for high-volume routing.

Google has moved Gemini 3.1 Flash-Lite to general availability for developers using Google AI Studio and Vertex AI. The release ends a two-month preview period and positions the model as the most cost-efficient option in the Gemini 3 series. The architecture targets high-frequency orchestration, multimodality, and rapid tool execution.

Performance and Architecture

Gemini 3.1 Flash-Lite maintains a 1,048,576 token input context window and a 65,536 token output limit. It natively processes text, PDFs, images, up to 8.4 hours of audio, and approximately 45 minutes of video.

For speed, the model registers a 2.5X faster time to first token compared to Gemini 2.5 Flash. Total output generation speed is 45% faster. Google reports a p95 latency of 1.8 seconds for full responses, dropping to sub-second speeds for classifiers and basic tool calling.

The model introduces selectable thinking levels, allowing developers to toggle between minimal, low, medium, and high reasoning capacity. This provides granular control over latency and compute for specific orchestration tasks.

Benchmark Results

Despite the lightweight footprint, the model demonstrates strong reasoning capabilities on standard evaluation metrics.

MetricScore
Arena.ai Leaderboard1432 Elo
GPQA Diamond86.9%

Pricing Structure

The API costs aggressively target high-volume generation and AI agents requiring continuous background processing.

Token TypeCost per 1M Tokens
Input$0.25
Output$1.50
Cache Write$1.00
Cache Read$0.025

Asynchronous workloads can leverage the Batch API for an additional 50% discount on these base rates.

Migration and Availability

Developers currently running the preview model must migrate to the GA endpoint. The gemini-3.1-flash-lite-preview string faces deprecation on May 11, 2026, with a final shutdown scheduled for May 25, 2026.

Early integrations emphasize the model’s utility in live environments. OffDeal uses the model for real-time investment research during live calls, while Astrocade applies it for multimodal safety checks. Krea.ai leverages the endpoint to format structured JSON output for prompt enhancement workflows.

If you maintain high-frequency orchestration layers, update your endpoint references before the May 25 cutoff and test the new minimal thinking level to further reduce latency on simple routing and classification tasks.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading