Google Graduates LiteRT NPU Acceleration to Production
Learn how to configure LiteRT for hardware-accelerated on-device AI inference using Google's production-ready NPU capabilities.
Google has promoted its NPU acceleration capabilities within LiteRT to full production. Released on April 23, 2026, the updated framework provides a unified API to abstract vendor-specific NPU SDKs. Developers can now deploy high-performance on-device models across Qualcomm, MediaTek, and Google Tensor hardware using a single workflow.
Compilation Strategies
LiteRT introduces two primary compilation paths for NPU execution. The choice depends on your target deployment environment and whether you prioritize initialization speed or cross-device compatibility. The Google AI Edge Portal provides a benchmarking service covering over 100 popular mobile phones to help you evaluate these options against real-world hardware.
| Strategy | Best For | Technical Advantage |
|---|---|---|
| Offline (AOT) | Known target SoC | Reduces initialization costs and memory footprint |
| Online (JIT) | Platform-agnostic distribution | Minimizes latency via caching on subsequent runs |
Online compilation caches the resulting execution graph. It only triggers a re-compilation when the underlying vendor plugins or the Android device fingerprint changes. This makes JIT a practical default when distributing applications through standard app stores.
Zero-Copy Memory Management
Hardware buffer interoperability is the primary mechanism for reducing latency. LiteRT utilizes AHardwareBuffer to implement zero-copy execution. The NPU accesses data directly in its own memory space. This architectural decision avoids expensive data round-trips to CPU memory during AI inference, freeing the CPU for concurrent application logic.
Hardware Expansion and Benchmarks
The framework extends beyond mobile devices into AI PCs and industrial edge hardware. LiteRT integrates with OpenVINO for Intel Core Ultra series 2 and 3 processors. For industrial IoT applications, it supports the Qualcomm Dragonwing IQ8 Series, which powers robotics hardware like the Arduino VENTUNO Q.
Google recorded substantial performance gains across specific on-device workloads. Gemma 3 1B achieves a 3x performance gain over GPU execution during the prefill stage on a Samsung Galaxy S25 Ultra. On the Qualcomm Snapdragon 8 Elite, supported operations see up to a 100x speedup over CPU execution and a 10x speedup compared to the GPU.
Application developers are using these latency margins to deploy heavier models without melting the host device. Google Meet deployed an Ultra-HD segmentation model that is 25x larger than its predecessor while maintaining the thermal headroom required for 30-minute video sessions. Epic Games utilizes the NPU for its Live Link Face Android app to hit 30 FPS for real-time computational facial solving.
Review the LiteRT repository for API integration patterns. When preparing to run LLMs locally, profile your models in the Google AI Edge Portal first to determine the most effective compilation strategy for your target device matrix.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
Google Drops Vision Encoders in Gemma 4 12B Multimodal Release
Google DeepMind's new 12-billion parameter model uses a unified architecture to process text, image, and native audio directly on laptops with 16GB of RAM.
How to Run Gemma 4 On-Device with LiteRT-LM
Learn how to configure LiteRT-LM to deploy the Gemma 4 model family locally across mobile, desktop, and edge environments with constrained JSON decoding.
AI Edge Gallery for Android Gains On-Device MCP and Gemma 4
Google updated the AI Edge Gallery Android app with experimental Model Context Protocol support, enabling on-device Gemma 4 models to use external web tools.
Cloudflare Rebuilds CLI on Vite Following VoidZero Acquisition
Cloudflare acquired VoidZero, bringing the Rust-based Vite build ecosystem internally to unify local development environments with global edge runtimes.
Cascaded Speech Pipeline Brings Reachy Mini Inference Local
Hugging Face released an offline conversational stack for the Reachy Mini robot that replaces cloud APIs with a local pipeline built on Gemma 4 and Qwen3-TTS.