Ai Engineering 3 min read

Frozen MTP Drafters Yield 3x Gemini Nano Speedup on Pixel 10

Google has introduced frozen Multi-Token Prediction for Gemini Nano, utilizing lightweight drafter models to triple on-device inference speeds.

Google Research has implemented a new technique called frozen Multi-Token Prediction (frozen MTP) to accelerate Gemini Nano models directly on Pixel hardware. By offloading token prediction to a smaller drafting model without retraining the base weights, the architecture bypasses standard auto-regressive memory bottlenecks. This approach yields up to a 3x speedup in token generation for edge deployments. If you build mobile applications requiring low-latency AI inference, this alters the performance expectations for local model execution.

Decoupling Prediction from Base Weights

Standard Multi-Token Prediction normally requires joint training of the primary model and its prediction heads. Google’s “frozen” implementation alters this requirement. The base Gemini Nano model remains unchanged. Instead, a lightweight MTP drafter predicts multiple future tokens simultaneously. The larger target model then verifies these predictions in a single forward pass.

This architecture directly addresses the memory-bandwidth constraint inherent in auto-regressive generation, where moving parameters from memory to compute units restricts speed. Because the drafter model operates in parallel, the verification step by the main target model becomes the only significant serial limitation. When the drafter correctly predicts a sequence, the system outputs multiple tokens in the time it usually takes to generate one.

Hardware Integration on Tensor G5

The optimizations are specifically tuned for the Tensor G5 chip powering the Pixel 10, Pixel 10 Pro, Pixel 10 Pro XL, and Pixel 10 Pro Fold. The system leverages idle compute cycles on the TPU to run the MTP drafter continuously alongside the verification passes.

Google notes that the frozen nature of the technique allows for easier backporting to older devices. However, hardware limitations on the Tensor G3 and G4 architectures will likely restrict the maximum 3x speedup on older Pixel 8 and Pixel 9 phones. This acceleration directly benefits new on-device features powered by Gemini Nano v3, including the Live Translate and Camera Coach applications. Reports also indicate this optimization supports Nano Banana 3.0, a specialized model variant for on-device 3D and video creation rolling out in the June 2026 feature drop.

Developer Availability and SDK Access

Third-party integration relies on existing Google mobile infrastructure rather than requiring custom tensor implementations. The frozen MTP capabilities are being embedded directly into LiteRT-LM (formerly TensorFlow Lite) and the Firebase Android SDK.

This release builds on Google’s rollout of MTP drafters for the Gemma 4 family earlier in May. Because the base model remains unmodified, developers integrating Gemini Nano via LiteRT-LM can utilize the hardware acceleration without managing newly trained weights or adjusting their existing prompt tokenization logic.

Update your on-device benchmarking strategies to account for the Tensor G5 TPU optimizations. When executing models locally, you should evaluate whether your application’s token generation latency is constrained by the base model size or the availability of idle compute cycles required for the drafting phase.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading