Frozen MTP Drafters Yield 3x Gemini Nano Speedup on Pixel 10
Google has introduced frozen Multi-Token Prediction for Gemini Nano, utilizing lightweight drafter models to triple on-device inference speeds.
Google Research has implemented a new technique called frozen Multi-Token Prediction (frozen MTP) to accelerate Gemini Nano models directly on Pixel hardware. By offloading token prediction to a smaller drafting model without retraining the base weights, the architecture bypasses standard auto-regressive memory bottlenecks. This approach yields up to a 3x speedup in token generation for edge deployments. If you build mobile applications requiring low-latency AI inference, this alters the performance expectations for local model execution.
Decoupling Prediction from Base Weights
Standard Multi-Token Prediction normally requires joint training of the primary model and its prediction heads. Google’s “frozen” implementation alters this requirement. The base Gemini Nano model remains unchanged. Instead, a lightweight MTP drafter predicts multiple future tokens simultaneously. The larger target model then verifies these predictions in a single forward pass.
This architecture directly addresses the memory-bandwidth constraint inherent in auto-regressive generation, where moving parameters from memory to compute units restricts speed. Because the drafter model operates in parallel, the verification step by the main target model becomes the only significant serial limitation. When the drafter correctly predicts a sequence, the system outputs multiple tokens in the time it usually takes to generate one.
Hardware Integration on Tensor G5
The optimizations are specifically tuned for the Tensor G5 chip powering the Pixel 10, Pixel 10 Pro, Pixel 10 Pro XL, and Pixel 10 Pro Fold. The system leverages idle compute cycles on the TPU to run the MTP drafter continuously alongside the verification passes.
Google notes that the frozen nature of the technique allows for easier backporting to older devices. However, hardware limitations on the Tensor G3 and G4 architectures will likely restrict the maximum 3x speedup on older Pixel 8 and Pixel 9 phones. This acceleration directly benefits new on-device features powered by Gemini Nano v3, including the Live Translate and Camera Coach applications. Reports also indicate this optimization supports Nano Banana 3.0, a specialized model variant for on-device 3D and video creation rolling out in the June 2026 feature drop.
Developer Availability and SDK Access
Third-party integration relies on existing Google mobile infrastructure rather than requiring custom tensor implementations. The frozen MTP capabilities are being embedded directly into LiteRT-LM (formerly TensorFlow Lite) and the Firebase Android SDK.
This release builds on Google’s rollout of MTP drafters for the Gemma 4 family earlier in May. Because the base model remains unmodified, developers integrating Gemini Nano via LiteRT-LM can utilize the hardware acceleration without managing newly trained weights or adjusting their existing prompt tokenization logic.
Update your on-device benchmarking strategies to account for the Tensor G5 TPU optimizations. When executing models locally, you should evaluate whether your application’s token generation latency is constrained by the base model size or the availability of idle compute cycles required for the drafting phase.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Run Gemma 4 On-Device with LiteRT-LM
Learn how to configure LiteRT-LM to deploy the Gemma 4 model family locally across mobile, desktop, and edge environments with constrained JSON decoding.
Google Drops Vision Encoders in Gemma 4 12B Multimodal Release
Google DeepMind's new 12-billion parameter model uses a unified architecture to process text, image, and native audio directly on laptops with 16GB of RAM.
AI Edge Gallery for Android Gains On-Device MCP and Gemma 4
Google updated the AI Edge Gallery Android app with experimental Model Context Protocol support, enabling on-device Gemma 4 models to use external web tools.
Google Graduates LiteRT NPU Acceleration to Production
Learn how to configure LiteRT for hardware-accelerated on-device AI inference using Google's production-ready NPU capabilities.
Encoder-Free Gemma 4 12B Fits Multimodal Agents on 16GB VRAM
Google DeepMind's new Gemma 4 12B removes separate vision and audio encoders, allowing native multimodal processing on laptops with 16GB of unified memory.