Google AI Edge Taps Arm SME2 for 5x Faster CPU Inference
Google and Arm have integrated SME2 micro-kernels into LiteRT, accelerating on-device generative AI workloads by up to 5x without custom assembly code.
On May 14, 2026, Google and Arm detailed a major optimization to the Google AI Edge software stack that turns standard device CPUs into high-performance AI accelerators. By leveraging Arm Scalable Matrix Extension 2 (SME2), the updated stack delivers up to 5x faster inference for matrix-heavy generative AI workloads. For developers building on-device applications, this update allows complex models to run entirely on the CPU, preserving the device’s NPU and GPU for concurrent hardware tasks.
Arm KleidiAI and XNNPACK Integration
The architectural breakthrough relies on integrating Arm KleidiAI micro-kernels directly into XNNPACK, the underlying execution engine for LiteRT (formerly TensorFlow Lite). SME2 hardware integrates a dedicated matrix-compute unit into the CPU cluster, utilizing Matrix Outer Product Accumulate (MOPA) instructions. These instructions allow the CPU to execute matrix multiplications significantly faster than standard SIMD or NEON instruction sets.
At the software layer, LiteRT automatically identifies math-intensive kernels, specifically iGeMM and GeMM, and delegates them to the SME2 hardware via KleidiAI. This abstraction means developers building AI inference pipelines do not need to write custom assembly code to access the hardware acceleration.
Stable Audio Open Small Benchmarks
Google and Arm benchmarked the optimization using Stability AI’s stable-audio-open-small, a 341-million parameter model. The deployment follows a strict “Convert, Optimize, and Deploy” pipeline. PyTorch models are first converted to the LiteRT format using the LiteRT-torch tool. Next, the AI Edge Quantizer applies mixed-precision (FP16/Int8) transformations to compress the model.
This quantization step achieved a 4x reduction in memory usage, enabling the high-quality audio model to load entirely into constrained mobile RAM. During generation benchmarks, the SME2 optimizations cut audio generation times by more than half across different architectures.
| Hardware Architecture | Baseline Inference Time | SME2 Optimized Time |
|---|---|---|
| Apple MacBook M4 | 10.0 seconds | 4.3 seconds |
| Android SME2 (Exynos 2600) | 14.0 seconds | 6.6 seconds |
Running on a single thread, the optimized Android device successfully generated 11 seconds of audio in under 8 seconds, crossing the threshold required for real-time interactive audio generation.
Hardware Scope and Gemma 4 Optimization
The optimization stack is currently live for developers targeting Apple M4-based devices and the iPhone 16 series. Broader Android availability will follow the rollout of upcoming SME2-enabled silicon, specifically targeting the Samsung Exynos 2600 with C1-Ultra and C1-Pro cores.
Google is already expanding the footprint of this CPU-bound execution model. The company is actively optimizing the Gemma 4 family for the AI Edge stack to support 128,000 token context windows entirely offline on mobile devices.
If you deploy generative models to edge devices, the LiteRT and SME2 integration shifts the resource constraints of your application. Profiling your matrix-heavy workloads against the CPU with KleidiAI enabled will indicate exactly how much NPU and GPU capacity you can reclaim for other application features.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
Google Graduates LiteRT NPU Acceleration to Production
Learn how to configure LiteRT for hardware-accelerated on-device AI inference using Google's production-ready NPU capabilities.
CompactifAI Now Lets You Compress LLMs Through an API
Multiverse rolled out an offline CompactifAI app and a public API portal to bring compressed AI models to edge devices and self-serve users.
How to Run NVIDIA Nemotron 3 Nano 4B Locally on Jetson and RTX
Learn to deploy NVIDIA's Nemotron 3 Nano 4B locally with BF16, FP8, or GGUF on Jetson, RTX, vLLM, TensorRT-LLM, and llama.cpp.
How to Run IBM Granite 4.0 1B Speech for Multilingual Edge ASR and Translation
Learn how to deploy IBM Granite 4.0 1B Speech for fast multilingual ASR and translation on edge devices.
Voxtral TTS: Mistral's Open-Source Answer to Voice Agents
Mistral’s reported Voxtral TTS release could help developers build low-latency, open-source voice apps and agents on edge devices.