Google Graduates LiteRT NPU Acceleration to Production
Learn how to configure LiteRT for hardware-accelerated on-device AI inference using Google's production-ready NPU capabilities.
Google has promoted its NPU acceleration capabilities within LiteRT to full production. Released on April 23, 2026, the updated framework provides a unified API to abstract vendor-specific NPU SDKs. Developers can now deploy high-performance on-device models across Qualcomm, MediaTek, and Google Tensor hardware using a single workflow.
Compilation Strategies
LiteRT introduces two primary compilation paths for NPU execution. The choice depends on your target deployment environment and whether you prioritize initialization speed or cross-device compatibility. The Google AI Edge Portal provides a benchmarking service covering over 100 popular mobile phones to help you evaluate these options against real-world hardware.
| Strategy | Best For | Technical Advantage |
|---|---|---|
| Offline (AOT) | Known target SoC | Reduces initialization costs and memory footprint |
| Online (JIT) | Platform-agnostic distribution | Minimizes latency via caching on subsequent runs |
Online compilation caches the resulting execution graph. It only triggers a re-compilation when the underlying vendor plugins or the Android device fingerprint changes. This makes JIT a practical default when distributing applications through standard app stores.
Zero-Copy Memory Management
Hardware buffer interoperability is the primary mechanism for reducing latency. LiteRT utilizes AHardwareBuffer to implement zero-copy execution. The NPU accesses data directly in its own memory space. This architectural decision avoids expensive data round-trips to CPU memory during AI inference, freeing the CPU for concurrent application logic.
Hardware Expansion and Benchmarks
The framework extends beyond mobile devices into AI PCs and industrial edge hardware. LiteRT integrates with OpenVINO for Intel Core Ultra series 2 and 3 processors. For industrial IoT applications, it supports the Qualcomm Dragonwing IQ8 Series, which powers robotics hardware like the Arduino VENTUNO Q.
Google recorded substantial performance gains across specific on-device workloads. Gemma 3 1B achieves a 3x performance gain over GPU execution during the prefill stage on a Samsung Galaxy S25 Ultra. On the Qualcomm Snapdragon 8 Elite, supported operations see up to a 100x speedup over CPU execution and a 10x speedup compared to the GPU.
Application developers are using these latency margins to deploy heavier models without melting the host device. Google Meet deployed an Ultra-HD segmentation model that is 25x larger than its predecessor while maintaining the thermal headroom required for 30-minute video sessions. Epic Games utilizes the NPU for its Live Link Face Android app to hit 30 FPS for real-time computational facial solving.
Review the LiteRT repository for API integration patterns. When preparing to run LLMs locally, profile your models in the Google AI Edge Portal first to determine the most effective compilation strategy for your target device matrix.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
Cloudflare Ships Panic and Abort Recovery for Rust Workers
Cloudflare updated Rust Workers to support WebAssembly exception handling, preventing isolated panics from crashing entire serverless instances.
NVIDIA Demos Gemma 4 VLA on $249 Jetson Orin Nano Super
NVIDIA showcased Google's Gemma 4 VLA running natively on the Jetson Orin Nano Super using NVFP4 quantization and a new 25W hardware performance mode.
Google launches TPU 8t for training and TPU 8i for inference
Google's eighth-generation TPUs split into the 8t for frontier training and the 8i for low-latency inference, with Broadcom and MediaTek as fab partners.
Scaling AI Gateway to Power Cloudflare's New Agentic Web
Cloudflare transforms its AI Gateway into a unified inference layer, offering persistent memory and dynamic runtimes to optimize multi-model agent workflows.
Boost Model Accuracy With MaxText Post-Training on TPUs
Google's MaxText adds SFT and Reinforcement Learning support for single-host TPUs, enabling efficient LLM refinement with GRPO and Tunix integration.