Google Graduates LiteRT NPU Acceleration to Production

Google has promoted its NPU acceleration capabilities within LiteRT to full production. Released on April 23, 2026, the updated framework provides a unified API to abstract vendor-specific NPU SDKs. Developers can now deploy high-performance on-device models across Qualcomm, MediaTek, and Google Tensor hardware using a single workflow.

Compilation Strategies

LiteRT introduces two primary compilation paths for NPU execution. The choice depends on your target deployment environment and whether you prioritize initialization speed or cross-device compatibility. The Google AI Edge Portal provides a benchmarking service covering over 100 popular mobile phones to help you evaluate these options against real-world hardware.

Strategy	Best For	Technical Advantage
Offline (AOT)	Known target SoC	Reduces initialization costs and memory footprint
Online (JIT)	Platform-agnostic distribution	Minimizes latency via caching on subsequent runs

Online compilation caches the resulting execution graph. It only triggers a re-compilation when the underlying vendor plugins or the Android device fingerprint changes. This makes JIT a practical default when distributing applications through standard app stores.

Zero-Copy Memory Management

Hardware buffer interoperability is the primary mechanism for reducing latency. LiteRT utilizes AHardwareBuffer to implement zero-copy execution. The NPU accesses data directly in its own memory space. This architectural decision avoids expensive data round-trips to CPU memory during AI inference, freeing the CPU for concurrent application logic.

Hardware Expansion and Benchmarks

The framework extends beyond mobile devices into AI PCs and industrial edge hardware. LiteRT integrates with OpenVINO for Intel Core Ultra series 2 and 3 processors. For industrial IoT applications, it supports the Qualcomm Dragonwing IQ8 Series, which powers robotics hardware like the Arduino VENTUNO Q.

Google recorded substantial performance gains across specific on-device workloads. Gemma 3 1B achieves a 3x performance gain over GPU execution during the prefill stage on a Samsung Galaxy S25 Ultra. On the Qualcomm Snapdragon 8 Elite, supported operations see up to a 100x speedup over CPU execution and a 10x speedup compared to the GPU.

Application developers are using these latency margins to deploy heavier models without melting the host device. Google Meet deployed an Ultra-HD segmentation model that is 25x larger than its predecessor while maintaining the thermal headroom required for 30-minute video sessions. Epic Games utilizes the NPU for its Live Link Face Android app to hit 30 FPS for real-time computational facial solving.

Review the LiteRT repository for API integration patterns. When preparing to run LLMs locally, profile your models in the Google AI Edge Portal first to determine the most effective compilation strategy for your target device matrix.

Google Graduates LiteRT NPU Acceleration to Production

Compilation Strategies

Zero-Copy Memory Management

Hardware Expansion and Benchmarks

Keep Reading

Google Drops Vision Encoders in Gemma 4 12B Multimodal Release

How to Run Gemma 4 On-Device with LiteRT-LM

AI Edge Gallery for Android Gains On-Device MCP and Gemma 4

Cloudflare Rebuilds CLI on Vite Following VoidZero Acquisition

Cascaded Speech Pipeline Brings Reachy Mini Inference Local