Google Gemini API Adds Flex and Priority Tiers for Scale
Google launches Flex and Priority inference tiers for the Gemini API, offering developers new ways to optimize costs and reliability for AI workflows.
Google launched two new synchronous inference tiers for the Gemini API to give developers granular control over workload execution. The addition of Flex and Priority inference bridges the gap between standard real-time requests and the 24-hour asynchronous Batch API. You now have distinct routing options for background processing and latency-critical production workloads.
Inference Tiers Compared
The update introduces two routing extremes to complement the existing standard API tier. Flex Inference targets cost reduction, while Priority Inference guarantees uptime and execution speed.
| Tier | Pricing | Latency Target | Reliability |
|---|---|---|---|
| Flex | 50% discount | 1 to 15 minutes | Sheddable (best-effort) |
| Standard | Baseline | Standard | Standard |
| Priority | 75% to 100% premium | Milliseconds to seconds | Non-sheddable |
Flex Tier for Background Workflows
Flex routing relies on opportunistic off-peak compute. Traffic sent to this tier is sheddable, meaning Google will preempt your requests during spikes in standard traffic. This structural tradeoff yields a 50% cost reduction for workloads that can tolerate latency targets of 1 to 15 minutes.
If you evaluate AI output using LLM-as-a-judge patterns, Flex provides a synchronous alternative to managing batch job queues. It handles offline evaluation, background CRM updates, and data enrichment pipelines where real-time responses are unnecessary. Flex requests do not receive extended rate limits and count directly against your general usage quotas.
Priority Tier for Critical Paths
Priority traffic routes directly to high-criticality compute queues. These requests are never preempted. The tier delivers response times in the millisecond to second range for a 75% to 100% pricing premium over standard rates.
Priority limits operate independently from standard quotas, typically capped at 0.3x of the model’s standard rate limit. If your application exceeds these dedicated limits, requests undergo graceful degradation. Instead of returning 429 rate limit errors, the API automatically downgrades excess requests to the standard tier for processing. This fallback mechanism ensures continuity for live content moderation, fraud detection, and real-time customer support bots.
Implementation and New Models
Activating these tiers requires passing the service_tier parameter in your API payload, such as config={"service_tier": "flex"}. The routing changes arrived alongside the rollout of the Gemini 3.1 model family, which includes Pro Preview and Flash-Lite. Google also released Gemma 4 in 26B and 31B parameter sizes alongside the API updates.
Enforced Billing Controls
These inference options align with a major overhaul of the Gemini API billing infrastructure effective April 1, 2026. Google now mandates prepaid billing for new users. This structure prevents unauthorized charge accumulation, addressing the bug from March 16 that caused unexpected billing spikes for some developers.
The infrastructure includes enforced monthly billing account spend caps across all projects. You can also configure optional project-level limits to pause API requests when specific thresholds are met. If you manage high-volume AI inference across multiple environments, these hard caps prevent runaway costs.
Review your current application workloads to separate immediate user-facing requests from asynchronous tasks. Moving your background data processing to the Flex tier will immediately cut your API spend by half without requiring you to build and maintain separate batch polling infrastructure.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Implement Event-Driven Webhooks in the Gemini API
Learn how to configure static and dynamic webhooks in the Gemini API to eliminate polling overhead for long-running AI operations and agent workflows.
Google launches TPU 8t for training and TPU 8i for inference
Google's eighth-generation TPUs split into the 8t for frontier training and the 8i for low-latency inference, with Broadcom and MediaTek as fab partners.
Boosting Kimi K2.5 Speed 3x via Cloudflare Infire Optimization
Cloudflare enhances Workers AI with the Infire engine, enabling extra-large models like Kimi K2.5 to run faster and more cost-effectively using Rust-based optimizations.
Intel’s Xeon 6 and Custom IPUs Coming to Google Cloud
Intel and Google expand their partnership to co-develop custom IPUs and deploy Xeon 6 processors for high-performance AI and hyperscale workloads.
XCENA's $135M Series B Targets AI Memory Wall via CXL 3.x
South Korean startup XCENA raised $135 million to build computational memory chips that embed RISC-V cores alongside DDR5 DRAM to reduce AI latency.