Ai Engineering 3 min read

Google Gemini API Adds Flex and Priority Tiers for Scale

Google launches Flex and Priority inference tiers for the Gemini API, offering developers new ways to optimize costs and reliability for AI workflows.

Google launched two new synchronous inference tiers for the Gemini API to give developers granular control over workload execution. The addition of Flex and Priority inference bridges the gap between standard real-time requests and the 24-hour asynchronous Batch API. You now have distinct routing options for background processing and latency-critical production workloads.

Inference Tiers Compared

The update introduces two routing extremes to complement the existing standard API tier. Flex Inference targets cost reduction, while Priority Inference guarantees uptime and execution speed.

TierPricingLatency TargetReliability
Flex50% discount1 to 15 minutesSheddable (best-effort)
StandardBaselineStandardStandard
Priority75% to 100% premiumMilliseconds to secondsNon-sheddable

Flex Tier for Background Workflows

Flex routing relies on opportunistic off-peak compute. Traffic sent to this tier is sheddable, meaning Google will preempt your requests during spikes in standard traffic. This structural tradeoff yields a 50% cost reduction for workloads that can tolerate latency targets of 1 to 15 minutes.

If you evaluate AI output using LLM-as-a-judge patterns, Flex provides a synchronous alternative to managing batch job queues. It handles offline evaluation, background CRM updates, and data enrichment pipelines where real-time responses are unnecessary. Flex requests do not receive extended rate limits and count directly against your general usage quotas.

Priority Tier for Critical Paths

Priority traffic routes directly to high-criticality compute queues. These requests are never preempted. The tier delivers response times in the millisecond to second range for a 75% to 100% pricing premium over standard rates.

Priority limits operate independently from standard quotas, typically capped at 0.3x of the model’s standard rate limit. If your application exceeds these dedicated limits, requests undergo graceful degradation. Instead of returning 429 rate limit errors, the API automatically downgrades excess requests to the standard tier for processing. This fallback mechanism ensures continuity for live content moderation, fraud detection, and real-time customer support bots.

Implementation and New Models

Activating these tiers requires passing the service_tier parameter in your API payload, such as config={"service_tier": "flex"}. The routing changes arrived alongside the rollout of the Gemini 3.1 model family, which includes Pro Preview and Flash-Lite. Google also released Gemma 4 in 26B and 31B parameter sizes alongside the API updates.

Enforced Billing Controls

These inference options align with a major overhaul of the Gemini API billing infrastructure effective April 1, 2026. Google now mandates prepaid billing for new users. This structure prevents unauthorized charge accumulation, addressing the bug from March 16 that caused unexpected billing spikes for some developers.

The infrastructure includes enforced monthly billing account spend caps across all projects. You can also configure optional project-level limits to pause API requests when specific thresholds are met. If you manage high-volume AI inference across multiple environments, these hard caps prevent runaway costs.

Review your current application workloads to separate immediate user-facing requests from asynchronous tasks. Moving your background data processing to the Flex tier will immediately cut your API spend by half without requiring you to build and maintain separate batch polling infrastructure.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading