Gemini 3.1 Flash-Lite Ships 1M Context at $0.25 Per Million

Google has moved Gemini 3.1 Flash-Lite to general availability for developers using Google AI Studio and Vertex AI. The release ends a two-month preview period and positions the model as the most cost-efficient option in the Gemini 3 series. The architecture targets high-frequency orchestration, multimodality, and rapid tool execution.

Performance and Architecture

Gemini 3.1 Flash-Lite maintains a 1,048,576 token input context window and a 65,536 token output limit. It natively processes text, PDFs, images, up to 8.4 hours of audio, and approximately 45 minutes of video.

For speed, the model registers a 2.5X faster time to first token compared to Gemini 2.5 Flash. Total output generation speed is 45% faster. Google reports a p95 latency of 1.8 seconds for full responses, dropping to sub-second speeds for classifiers and basic tool calling.

The model introduces selectable thinking levels, allowing developers to toggle between minimal, low, medium, and high reasoning capacity. This provides granular control over latency and compute for specific orchestration tasks.

Benchmark Results

Despite the lightweight footprint, the model demonstrates strong reasoning capabilities on standard evaluation metrics.

Metric	Score
Arena.ai Leaderboard	1432 Elo
GPQA Diamond	86.9%

Pricing Structure

The API costs aggressively target high-volume generation and AI agents requiring continuous background processing.

Token Type	Cost per 1M Tokens
Input	$0.25
Output	$1.50
Cache Write	$1.00
Cache Read	$0.025

Asynchronous workloads can leverage the Batch API for an additional 50% discount on these base rates.

Migration and Availability

Developers currently running the preview model must migrate to the GA endpoint. The gemini-3.1-flash-lite-preview string faces deprecation on May 11, 2026, with a final shutdown scheduled for May 25, 2026.

Early integrations emphasize the model’s utility in live environments. OffDeal uses the model for real-time investment research during live calls, while Astrocade applies it for multimodal safety checks. Krea.ai leverages the endpoint to format structured JSON output for prompt enhancement workflows.

If you maintain high-frequency orchestration layers, update your endpoint references before the May 25 cutoff and test the new minimal thinking level to further reduce latency on simple routing and classification tasks.

Gemini 3.1 Flash-Lite Ships 1M Context at $0.25 Per Million

Performance and Architecture

Benchmark Results

Pricing Structure

Migration and Availability

Keep Reading

Google Graduates LiteRT NPU Acceleration to Production

IBM Granite 4.1 Pushes Dense 8B Model Past Previous 32B MoE

AutoScientist Automates Simultaneous Data and Weight Tuning

CyberSecQwen-4B Defeats Cisco 8B on CTI-MCQ Benchmark

EMO Pretraining Decouples Mixture-of-Experts Subsets