Gemini 3.1 Flash-Lite Ships 1M Context at $0.25 Per Million
Google's lowest-latency Gemini model is now generally available, introducing variable thinking levels and a 1M token context window for high-volume routing.
Google has moved Gemini 3.1 Flash-Lite to general availability for developers using Google AI Studio and Vertex AI. The release ends a two-month preview period and positions the model as the most cost-efficient option in the Gemini 3 series. The architecture targets high-frequency orchestration, multimodality, and rapid tool execution.
Performance and Architecture
Gemini 3.1 Flash-Lite maintains a 1,048,576 token input context window and a 65,536 token output limit. It natively processes text, PDFs, images, up to 8.4 hours of audio, and approximately 45 minutes of video.
For speed, the model registers a 2.5X faster time to first token compared to Gemini 2.5 Flash. Total output generation speed is 45% faster. Google reports a p95 latency of 1.8 seconds for full responses, dropping to sub-second speeds for classifiers and basic tool calling.
The model introduces selectable thinking levels, allowing developers to toggle between minimal, low, medium, and high reasoning capacity. This provides granular control over latency and compute for specific orchestration tasks.
Benchmark Results
Despite the lightweight footprint, the model demonstrates strong reasoning capabilities on standard evaluation metrics.
| Metric | Score |
|---|---|
| Arena.ai Leaderboard | 1432 Elo |
| GPQA Diamond | 86.9% |
Pricing Structure
The API costs aggressively target high-volume generation and AI agents requiring continuous background processing.
| Token Type | Cost per 1M Tokens |
|---|---|
| Input | $0.25 |
| Output | $1.50 |
| Cache Write | $1.00 |
| Cache Read | $0.025 |
Asynchronous workloads can leverage the Batch API for an additional 50% discount on these base rates.
Migration and Availability
Developers currently running the preview model must migrate to the GA endpoint. The gemini-3.1-flash-lite-preview string faces deprecation on May 11, 2026, with a final shutdown scheduled for May 25, 2026.
Early integrations emphasize the model’s utility in live environments. OffDeal uses the model for real-time investment research during live calls, while Astrocade applies it for multimodal safety checks. Krea.ai leverages the endpoint to format structured JSON output for prompt enhancement workflows.
If you maintain high-frequency orchestration layers, update your endpoint references before the May 25 cutoff and test the new minimal thinking level to further reduce latency on simple routing and classification tasks.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
Google Graduates LiteRT NPU Acceleration to Production
Learn how to configure LiteRT for hardware-accelerated on-device AI inference using Google's production-ready NPU capabilities.
IBM Granite 4.1 Pushes Dense 8B Model Past Previous 32B MoE
IBM released the Granite 4.1 open-source model family featuring dense text architectures, a 512K context window, and specialized vision and speech variants.
AutoScientist Automates Simultaneous Data and Weight Tuning
Adaption launched AutoScientist to automate model fine-tuning by optimizing training datasets and model weights simultaneously.
CyberSecQwen-4B Defeats Cisco 8B on CTI-MCQ Benchmark
Team athena19 fine-tuned a 4-billion parameter model on a single AMD MI300X GPU that outperforms Cisco's 8B model for defensive cyber threat intelligence.
EMO Pretraining Decouples Mixture-of-Experts Subsets
AI2 and UC Berkeley researchers introduced EMO, a pretraining constraint that groups MoE experts by semantic domain to allow independent subnet deployment.