Google AI Edge Taps Arm SME2 for 5x Faster CPU Inference

On May 14, 2026, Google and Arm detailed a major optimization to the Google AI Edge software stack that turns standard device CPUs into high-performance AI accelerators. By leveraging Arm Scalable Matrix Extension 2 (SME2), the updated stack delivers up to 5x faster inference for matrix-heavy generative AI workloads. For developers building on-device applications, this update allows complex models to run entirely on the CPU, preserving the device’s NPU and GPU for concurrent hardware tasks.

Arm KleidiAI and XNNPACK Integration

The architectural breakthrough relies on integrating Arm KleidiAI micro-kernels directly into XNNPACK, the underlying execution engine for LiteRT (formerly TensorFlow Lite). SME2 hardware integrates a dedicated matrix-compute unit into the CPU cluster, utilizing Matrix Outer Product Accumulate (MOPA) instructions. These instructions allow the CPU to execute matrix multiplications significantly faster than standard SIMD or NEON instruction sets.

At the software layer, LiteRT automatically identifies math-intensive kernels, specifically iGeMM and GeMM, and delegates them to the SME2 hardware via KleidiAI. This abstraction means developers building AI inference pipelines do not need to write custom assembly code to access the hardware acceleration.

Stable Audio Open Small Benchmarks

Google and Arm benchmarked the optimization using Stability AI’s stable-audio-open-small, a 341-million parameter model. The deployment follows a strict “Convert, Optimize, and Deploy” pipeline. PyTorch models are first converted to the LiteRT format using the LiteRT-torch tool. Next, the AI Edge Quantizer applies mixed-precision (FP16/Int8) transformations to compress the model.

This quantization step achieved a 4x reduction in memory usage, enabling the high-quality audio model to load entirely into constrained mobile RAM. During generation benchmarks, the SME2 optimizations cut audio generation times by more than half across different architectures.

Hardware Architecture	Baseline Inference Time	SME2 Optimized Time
Apple MacBook M4	10.0 seconds	4.3 seconds
Android SME2 (Exynos 2600)	14.0 seconds	6.6 seconds

Running on a single thread, the optimized Android device successfully generated 11 seconds of audio in under 8 seconds, crossing the threshold required for real-time interactive audio generation.

Hardware Scope and Gemma 4 Optimization

The optimization stack is currently live for developers targeting Apple M4-based devices and the iPhone 16 series. Broader Android availability will follow the rollout of upcoming SME2-enabled silicon, specifically targeting the Samsung Exynos 2600 with C1-Ultra and C1-Pro cores.

Google is already expanding the footprint of this CPU-bound execution model. The company is actively optimizing the Gemma 4 family for the AI Edge stack to support 128,000 token context windows entirely offline on mobile devices.

If you deploy generative models to edge devices, the LiteRT and SME2 integration shifts the resource constraints of your application. Profiling your matrix-heavy workloads against the CPU with KleidiAI enabled will indicate exactly how much NPU and GPU capacity you can reclaim for other application features.

Google AI Edge Taps Arm SME2 for 5x Faster CPU Inference

Arm KleidiAI and XNNPACK Integration

Stable Audio Open Small Benchmarks

Hardware Scope and Gemma 4 Optimization

Keep Reading

Google Graduates LiteRT NPU Acceleration to Production

CompactifAI Now Lets You Compress LLMs Through an API

How to Run NVIDIA Nemotron 3 Nano 4B Locally on Jetson and RTX

How to Run IBM Granite 4.0 1B Speech for Multilingual Edge ASR and Translation

Voxtral TTS: Mistral's Open-Source Answer to Voice Agents