How to Run Gemma 4 On-Device with LiteRT-LM

Google’s production-ready LiteRT-LM orchestration stack enables you to run the Gemma 4 model family directly on local mobile devices, MacBooks, and serverless browser environments. By combining Multi-Token Prediction with specialized memory management, developers can execute complex reasoning tasks on-device while maintaining a physical memory footprint well under 1GB. This removes the need to build custom wrappers for tokenization, chat templating, and session management when deploying generative models to edge devices.

Architecture and Memory Management

LiteRT-LM operates as an orchestration layer positioned directly above the core LiteRT runtime, handling the conversational scaffolding that raw inference engines lack. Prior to this release, developers deploying large models to edge environments had to manually implement BPE and SentencePiece tokenizers, map chat templates, and manage context windows. LiteRT-LM integrates these natively into the runtime stack.

The framework achieves high memory efficiency by utilizing XNNPACK’s weight caching and per-layer embeddings. When executing the 2.58GB Gemma 4 E2B model, the runtime compresses the physical memory footprint down to 607MB on Apple mobile CPUs. This aggressive compression allows background applications to run large models locally without triggering system out-of-memory terminations.

Selecting the Right Gemma 4 Model

The Gemma 4 family, released under the Apache 2.0 license, is structured to support distinct edge computing targets. Selecting the correct variant is critical for balancing latency and reasoning capabilities within your target device constraints.

Gemma 4 E2B (2B parameters): Designed for IoT hardware and mid-tier mobile devices. It requires less than 1.5GB of total system RAM.
Gemma 4 E4B (4B parameters): Built for premium mobile architectures and Apple Silicon, offering higher reasoning accuracy for desktop applications.
Gemma 4 26B MoE: A Mixture-of-Experts architecture utilizing only 4B active parameters per forward pass. It delivers the knowledge breadth of a 26B parameter model while operating at the inference speed of a standard 4B model.

Hardware Performance and Benchmarks

Inference speed scales according to the target hardware and the compute backend utilized. The following table details the official prefill and decode metrics for the Gemma 4 E2B model across standard deployment environments.

Platform / Device	Backend	Prefill (tk/s)	Decode (tk/s)	Peak Memory (MB)
Android (S26 Ultra)	GPU	3808	52	676
iOS (iPhone 17 Pro)	CPU	532	25	607
MacBook Pro M4	GPU	7835	160	1623
Raspberry Pi 5	CPU	133	7.8	1546

Advanced Orchestration Capabilities

LiteRT-LM includes built-in orchestration tools that manage output generation and continuity for sophisticated agentic workflows operating without cloud connectivity.

Multi-Token Prediction and Speed

The runtime integrates Multi-Token Prediction (MTP) and speculative decoding. Instead of generating a single token per pass, the model projects multiple subsequent tokens simultaneously, achieving up to a 2.2x speedup in on-device inference latency. This requires configuring the runtime to support larger speculative batch sizes, which marginally increases peak memory consumption.

Constrained Decoding and Thinking Mode

Applications requiring strict programmatic integration can enable Constrained Decoding to enforce guaranteed constrained JSON output. This prevents schema drift during long context generation. The framework also supports a Thinking Mode, which allocates additional compute cycles for step-by-step reasoning before generating a final response, mimicking the operation of server-side inference engines.

Session Save and Restore

For continuous applications, LiteRT-LM provides Session Save and Restore functionality. You can serialize the KV cache state to local storage and resume long-context conversations later without needing to reprocess the entire prompt history. This drastically reduces battery consumption and prefill times for intermittent background agents.

Platform APIs and Deployment

LiteRT-LM expands deployment far beyond Android environments. Apple developers can implement the engine using native Swift APIs, integrating directly with iOS and macOS application lifecycles. Web developers can access the runtime through WebGPU-accelerated JavaScript APIs, allowing serverless browser inference with near-native performance.

For specific implementation parameters, parameter lists, and backend initialization flags, refer to the LiteRT-LM documentation. Additionally, developers working with the Pixel 10 can utilize the Google Tensor SDK Beta to offload specific compute graphs directly to the device’s TPU, unlocking real-time inference for high-bandwidth tasks.

Tradeoffs and Limitations

Deploying large models to edge hardware requires strict power and thermal management. While the CPU backend on iOS achieves an impressive 607MB memory footprint, decode speeds plateau at 25 tokens per second. Relying entirely on GPU backends yields significantly faster prefill rates but increases total memory allocation, as seen on the MacBook Pro M4 pulling 1623MB. You must benchmark your specific application logic against your target device’s thermal throttling limits, as sustained generation will degrade performance over time.

To begin deployment, download the Gemma 4 weights from the Google AI Edge Gallery and configure your application build scripts to target the LiteRT-LM runtime layer.

How to Run Gemma 4 On-Device with LiteRT-LM

Architecture and Memory Management

Selecting the Right Gemma 4 Model

Hardware Performance and Benchmarks

Advanced Orchestration Capabilities

Multi-Token Prediction and Speed

Constrained Decoding and Thinking Mode

Session Save and Restore

Platform APIs and Deployment

Tradeoffs and Limitations

Keep Reading

Google Drops Vision Encoders in Gemma 4 12B Multimodal Release

Encoder-Free Gemma 4 12B Fits Multimodal Agents on 16GB VRAM

Frozen MTP Drafters Yield 3x Gemini Nano Speedup on Pixel 10

AI Edge Gallery for Android Gains On-Device MCP and Gemma 4

Google Graduates LiteRT NPU Acceleration to Production