Google Gemini API Adds Flex and Priority Tiers for Scale
Google launches Flex and Priority inference tiers for the Gemini API, offering developers new ways to optimize costs and reliability for AI workflows.
Google launched two new synchronous inference tiers for the Gemini API to give developers granular control over workload execution. The addition of Flex and Priority inference bridges the gap between standard real-time requests and the 24-hour asynchronous Batch API. You now have distinct routing options for background processing and latency-critical production workloads.
Inference Tiers Compared
The update introduces two routing extremes to complement the existing standard API tier. Flex Inference targets cost reduction, while Priority Inference guarantees uptime and execution speed.
| Tier | Pricing | Latency Target | Reliability |
|---|---|---|---|
| Flex | 50% discount | 1 to 15 minutes | Sheddable (best-effort) |
| Standard | Baseline | Standard | Standard |
| Priority | 75% to 100% premium | Milliseconds to seconds | Non-sheddable |
Flex Tier for Background Workflows
Flex routing relies on opportunistic off-peak compute. Traffic sent to this tier is sheddable, meaning Google will preempt your requests during spikes in standard traffic. This structural tradeoff yields a 50% cost reduction for workloads that can tolerate latency targets of 1 to 15 minutes.
If you evaluate AI output using LLM-as-a-judge patterns, Flex provides a synchronous alternative to managing batch job queues. It handles offline evaluation, background CRM updates, and data enrichment pipelines where real-time responses are unnecessary. Flex requests do not receive extended rate limits and count directly against your general usage quotas.
Priority Tier for Critical Paths
Priority traffic routes directly to high-criticality compute queues. These requests are never preempted. The tier delivers response times in the millisecond to second range for a 75% to 100% pricing premium over standard rates.
Priority limits operate independently from standard quotas, typically capped at 0.3x of the model’s standard rate limit. If your application exceeds these dedicated limits, requests undergo graceful degradation. Instead of returning 429 rate limit errors, the API automatically downgrades excess requests to the standard tier for processing. This fallback mechanism ensures continuity for live content moderation, fraud detection, and real-time customer support bots.
Implementation and New Models
Activating these tiers requires passing the service_tier parameter in your API payload, such as config={"service_tier": "flex"}. The routing changes arrived alongside the rollout of the Gemini 3.1 model family, which includes Pro Preview and Flash-Lite. Google also released Gemma 4 in 26B and 31B parameter sizes alongside the API updates.
Enforced Billing Controls
These inference options align with a major overhaul of the Gemini API billing infrastructure effective April 1, 2026. Google now mandates prepaid billing for new users. This structure prevents unauthorized charge accumulation, addressing the bug from March 16 that caused unexpected billing spikes for some developers.
The infrastructure includes enforced monthly billing account spend caps across all projects. You can also configure optional project-level limits to pause API requests when specific thresholds are met. If you manage high-volume AI inference across multiple environments, these hard caps prevent runaway costs.
Review your current application workloads to separate immediate user-facing requests from asynchronous tasks. Moving your background data processing to the Flex tier will immediately cut your API spend by half without requiring you to build and maintain separate batch polling infrastructure.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Use Amazon Polly's Bidirectional Streaming API
Learn how to use Amazon Polly’s new HTTP/2 bidirectional streaming to reduce latency in real-time conversational AI by streaming text and audio simultaneously.
Microsoft Releases MAI-Transcribe-1 to Rival Whisper
Microsoft AI unveils MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 to reduce reliance on OpenAI with high-efficiency, in-house foundational models.
Google Releases Veo 3.1 Lite for Low-Cost Video Generation via Gemini API
Google's new Veo 3.1 Lite model offers cost-effective 720p and 1080p video generation with native audio via the Gemini API and Google AI Studio.
ScaleOps Raises $130M to Automate AI Infrastructure
ScaleOps secures $130 million in Series C funding to scale its autonomous Kubernetes platform and optimize GPU resources for the AI era.
Mistral AI Raises $830M for New Data Center Near Paris
Mistral AI has secured $830 million in debt financing to build a sovereign data center in France featuring 13,800 NVIDIA Blackwell GPUs.