Ai Engineering 3 min read

Google Graduates LiteRT NPU Acceleration to Production

Learn how to configure LiteRT for hardware-accelerated on-device AI inference using Google's production-ready NPU capabilities.

Google has promoted its NPU acceleration capabilities within LiteRT to full production. Released on April 23, 2026, the updated framework provides a unified API to abstract vendor-specific NPU SDKs. Developers can now deploy high-performance on-device models across Qualcomm, MediaTek, and Google Tensor hardware using a single workflow.

Compilation Strategies

LiteRT introduces two primary compilation paths for NPU execution. The choice depends on your target deployment environment and whether you prioritize initialization speed or cross-device compatibility. The Google AI Edge Portal provides a benchmarking service covering over 100 popular mobile phones to help you evaluate these options against real-world hardware.

StrategyBest ForTechnical Advantage
Offline (AOT)Known target SoCReduces initialization costs and memory footprint
Online (JIT)Platform-agnostic distributionMinimizes latency via caching on subsequent runs

Online compilation caches the resulting execution graph. It only triggers a re-compilation when the underlying vendor plugins or the Android device fingerprint changes. This makes JIT a practical default when distributing applications through standard app stores.

Zero-Copy Memory Management

Hardware buffer interoperability is the primary mechanism for reducing latency. LiteRT utilizes AHardwareBuffer to implement zero-copy execution. The NPU accesses data directly in its own memory space. This architectural decision avoids expensive data round-trips to CPU memory during AI inference, freeing the CPU for concurrent application logic.

Hardware Expansion and Benchmarks

The framework extends beyond mobile devices into AI PCs and industrial edge hardware. LiteRT integrates with OpenVINO for Intel Core Ultra series 2 and 3 processors. For industrial IoT applications, it supports the Qualcomm Dragonwing IQ8 Series, which powers robotics hardware like the Arduino VENTUNO Q.

Google recorded substantial performance gains across specific on-device workloads. Gemma 3 1B achieves a 3x performance gain over GPU execution during the prefill stage on a Samsung Galaxy S25 Ultra. On the Qualcomm Snapdragon 8 Elite, supported operations see up to a 100x speedup over CPU execution and a 10x speedup compared to the GPU.

Application developers are using these latency margins to deploy heavier models without melting the host device. Google Meet deployed an Ultra-HD segmentation model that is 25x larger than its predecessor while maintaining the thermal headroom required for 30-minute video sessions. Epic Games utilizes the NPU for its Live Link Face Android app to hit 30 FPS for real-time computational facial solving.

Review the LiteRT repository for API integration patterns. When preparing to run LLMs locally, profile your models in the Google AI Edge Portal first to determine the most effective compilation strategy for your target device matrix.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading