How to Run Gemma 4 On-Device with LiteRT-LM
Learn how to configure LiteRT-LM to deploy the Gemma 4 model family locally across mobile, desktop, and edge environments with constrained JSON decoding.
Google’s production-ready LiteRT-LM orchestration stack enables you to run the Gemma 4 model family directly on local mobile devices, MacBooks, and serverless browser environments. By combining Multi-Token Prediction with specialized memory management, developers can execute complex reasoning tasks on-device while maintaining a physical memory footprint well under 1GB. This removes the need to build custom wrappers for tokenization, chat templating, and session management when deploying generative models to edge devices.
Architecture and Memory Management
LiteRT-LM operates as an orchestration layer positioned directly above the core LiteRT runtime, handling the conversational scaffolding that raw inference engines lack. Prior to this release, developers deploying large models to edge environments had to manually implement BPE and SentencePiece tokenizers, map chat templates, and manage context windows. LiteRT-LM integrates these natively into the runtime stack.
The framework achieves high memory efficiency by utilizing XNNPACK’s weight caching and per-layer embeddings. When executing the 2.58GB Gemma 4 E2B model, the runtime compresses the physical memory footprint down to 607MB on Apple mobile CPUs. This aggressive compression allows background applications to run large models locally without triggering system out-of-memory terminations.
Selecting the Right Gemma 4 Model
The Gemma 4 family, released under the Apache 2.0 license, is structured to support distinct edge computing targets. Selecting the correct variant is critical for balancing latency and reasoning capabilities within your target device constraints.
- Gemma 4 E2B (2B parameters): Designed for IoT hardware and mid-tier mobile devices. It requires less than 1.5GB of total system RAM.
- Gemma 4 E4B (4B parameters): Built for premium mobile architectures and Apple Silicon, offering higher reasoning accuracy for desktop applications.
- Gemma 4 26B MoE: A Mixture-of-Experts architecture utilizing only 4B active parameters per forward pass. It delivers the knowledge breadth of a 26B parameter model while operating at the inference speed of a standard 4B model.
Hardware Performance and Benchmarks
Inference speed scales according to the target hardware and the compute backend utilized. The following table details the official prefill and decode metrics for the Gemma 4 E2B model across standard deployment environments.
| Platform / Device | Backend | Prefill (tk/s) | Decode (tk/s) | Peak Memory (MB) |
|---|---|---|---|---|
| Android (S26 Ultra) | GPU | 3808 | 52 | 676 |
| iOS (iPhone 17 Pro) | CPU | 532 | 25 | 607 |
| MacBook Pro M4 | GPU | 7835 | 160 | 1623 |
| Raspberry Pi 5 | CPU | 133 | 7.8 | 1546 |
Advanced Orchestration Capabilities
LiteRT-LM includes built-in orchestration tools that manage output generation and continuity for sophisticated agentic workflows operating without cloud connectivity.
Multi-Token Prediction and Speed
The runtime integrates Multi-Token Prediction (MTP) and speculative decoding. Instead of generating a single token per pass, the model projects multiple subsequent tokens simultaneously, achieving up to a 2.2x speedup in on-device inference latency. This requires configuring the runtime to support larger speculative batch sizes, which marginally increases peak memory consumption.
Constrained Decoding and Thinking Mode
Applications requiring strict programmatic integration can enable Constrained Decoding to enforce guaranteed constrained JSON output. This prevents schema drift during long context generation. The framework also supports a Thinking Mode, which allocates additional compute cycles for step-by-step reasoning before generating a final response, mimicking the operation of server-side inference engines.
Session Save and Restore
For continuous applications, LiteRT-LM provides Session Save and Restore functionality. You can serialize the KV cache state to local storage and resume long-context conversations later without needing to reprocess the entire prompt history. This drastically reduces battery consumption and prefill times for intermittent background agents.
Platform APIs and Deployment
LiteRT-LM expands deployment far beyond Android environments. Apple developers can implement the engine using native Swift APIs, integrating directly with iOS and macOS application lifecycles. Web developers can access the runtime through WebGPU-accelerated JavaScript APIs, allowing serverless browser inference with near-native performance.
For specific implementation parameters, parameter lists, and backend initialization flags, refer to the LiteRT-LM documentation. Additionally, developers working with the Pixel 10 can utilize the Google Tensor SDK Beta to offload specific compute graphs directly to the device’s TPU, unlocking real-time inference for high-bandwidth tasks.
Tradeoffs and Limitations
Deploying large models to edge hardware requires strict power and thermal management. While the CPU backend on iOS achieves an impressive 607MB memory footprint, decode speeds plateau at 25 tokens per second. Relying entirely on GPU backends yields significantly faster prefill rates but increases total memory allocation, as seen on the MacBook Pro M4 pulling 1623MB. You must benchmark your specific application logic against your target device’s thermal throttling limits, as sustained generation will degrade performance over time.
To begin deployment, download the Gemma 4 weights from the Google AI Edge Gallery and configure your application build scripts to target the LiteRT-LM runtime layer.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
AI Edge Gallery for Android Gains On-Device MCP and Gemma 4
Google updated the AI Edge Gallery Android app with experimental Model Context Protocol support, enabling on-device Gemma 4 models to use external web tools.
Google Graduates LiteRT NPU Acceleration to Production
Learn how to configure LiteRT for hardware-accelerated on-device AI inference using Google's production-ready NPU capabilities.
Google AI Edge Eloquent brings free offline dictation to iOS
Google's new AI Edge Eloquent app uses Gemma 4 models to offer high-quality, offline-first transcription and text polishing for free on iPhone.
Gemini Intelligence System Debuts With Googlebooks Platform
Google introduced the Gemini Intelligence system, a unified Android and ChromeOS core powering a new laptop hardware category called Googlebooks.
Native iOS 27 Workloads Can Now Route to Claude and Gemini
Apple's Extensions framework for iOS 27 allows developers to integrate third-party AI models directly into native Siri and Writing Tools workflows.