Google Gemini API Adds Flex and Priority Tiers for Scale

Google launched two new synchronous inference tiers for the Gemini API to give developers granular control over workload execution. The addition of Flex and Priority inference bridges the gap between standard real-time requests and the 24-hour asynchronous Batch API. You now have distinct routing options for background processing and latency-critical production workloads.

Inference Tiers Compared

The update introduces two routing extremes to complement the existing standard API tier. Flex Inference targets cost reduction, while Priority Inference guarantees uptime and execution speed.

Tier	Pricing	Latency Target	Reliability
Flex	50% discount	1 to 15 minutes	Sheddable (best-effort)
Standard	Baseline	Standard	Standard
Priority	75% to 100% premium	Milliseconds to seconds	Non-sheddable

Flex Tier for Background Workflows

Flex routing relies on opportunistic off-peak compute. Traffic sent to this tier is sheddable, meaning Google will preempt your requests during spikes in standard traffic. This structural tradeoff yields a 50% cost reduction for workloads that can tolerate latency targets of 1 to 15 minutes.

If you evaluate AI output using LLM-as-a-judge patterns, Flex provides a synchronous alternative to managing batch job queues. It handles offline evaluation, background CRM updates, and data enrichment pipelines where real-time responses are unnecessary. Flex requests do not receive extended rate limits and count directly against your general usage quotas.

Priority Tier for Critical Paths

Priority traffic routes directly to high-criticality compute queues. These requests are never preempted. The tier delivers response times in the millisecond to second range for a 75% to 100% pricing premium over standard rates.

Priority limits operate independently from standard quotas, typically capped at 0.3x of the model’s standard rate limit. If your application exceeds these dedicated limits, requests undergo graceful degradation. Instead of returning 429 rate limit errors, the API automatically downgrades excess requests to the standard tier for processing. This fallback mechanism ensures continuity for live content moderation, fraud detection, and real-time customer support bots.

Implementation and New Models

Activating these tiers requires passing the service_tier parameter in your API payload, such as config={"service_tier": "flex"}. The routing changes arrived alongside the rollout of the Gemini 3.1 model family, which includes Pro Preview and Flash-Lite. Google also released Gemma 4 in 26B and 31B parameter sizes alongside the API updates.

Enforced Billing Controls

These inference options align with a major overhaul of the Gemini API billing infrastructure effective April 1, 2026. Google now mandates prepaid billing for new users. This structure prevents unauthorized charge accumulation, addressing the bug from March 16 that caused unexpected billing spikes for some developers.

The infrastructure includes enforced monthly billing account spend caps across all projects. You can also configure optional project-level limits to pause API requests when specific thresholds are met. If you manage high-volume AI inference across multiple environments, these hard caps prevent runaway costs.

Review your current application workloads to separate immediate user-facing requests from asynchronous tasks. Moving your background data processing to the Flex tier will immediately cut your API spend by half without requiring you to build and maintain separate batch polling infrastructure.

Google Gemini API Adds Flex and Priority Tiers for Scale

Inference Tiers Compared

Flex Tier for Background Workflows

Priority Tier for Critical Paths

Implementation and New Models

Enforced Billing Controls

Keep Reading

How to Configure Elastic Training in MaxText on TPUs

Google launches TPU 8t for training and TPU 8i for inference

Boosting Kimi K2.5 Speed 3x via Cloudflare Infire Optimization

Intel’s Xeon 6 and Custom IPUs Coming to Google Cloud

Google OKF Spec Standardizes Markdown Context for AI Agents