Ai Engineering 5 min read

How to Run Gemma 4 On-Device with LiteRT-LM

Learn how to configure LiteRT-LM to deploy the Gemma 4 model family locally across mobile, desktop, and edge environments with constrained JSON decoding.

Google’s production-ready LiteRT-LM orchestration stack enables you to run the Gemma 4 model family directly on local mobile devices, MacBooks, and serverless browser environments. By combining Multi-Token Prediction with specialized memory management, developers can execute complex reasoning tasks on-device while maintaining a physical memory footprint well under 1GB. This removes the need to build custom wrappers for tokenization, chat templating, and session management when deploying generative models to edge devices.

Architecture and Memory Management

LiteRT-LM operates as an orchestration layer positioned directly above the core LiteRT runtime, handling the conversational scaffolding that raw inference engines lack. Prior to this release, developers deploying large models to edge environments had to manually implement BPE and SentencePiece tokenizers, map chat templates, and manage context windows. LiteRT-LM integrates these natively into the runtime stack.

The framework achieves high memory efficiency by utilizing XNNPACK’s weight caching and per-layer embeddings. When executing the 2.58GB Gemma 4 E2B model, the runtime compresses the physical memory footprint down to 607MB on Apple mobile CPUs. This aggressive compression allows background applications to run large models locally without triggering system out-of-memory terminations.

Selecting the Right Gemma 4 Model

The Gemma 4 family, released under the Apache 2.0 license, is structured to support distinct edge computing targets. Selecting the correct variant is critical for balancing latency and reasoning capabilities within your target device constraints.

  • Gemma 4 E2B (2B parameters): Designed for IoT hardware and mid-tier mobile devices. It requires less than 1.5GB of total system RAM.
  • Gemma 4 E4B (4B parameters): Built for premium mobile architectures and Apple Silicon, offering higher reasoning accuracy for desktop applications.
  • Gemma 4 26B MoE: A Mixture-of-Experts architecture utilizing only 4B active parameters per forward pass. It delivers the knowledge breadth of a 26B parameter model while operating at the inference speed of a standard 4B model.

Hardware Performance and Benchmarks

Inference speed scales according to the target hardware and the compute backend utilized. The following table details the official prefill and decode metrics for the Gemma 4 E2B model across standard deployment environments.

Platform / DeviceBackendPrefill (tk/s)Decode (tk/s)Peak Memory (MB)
Android (S26 Ultra)GPU380852676
iOS (iPhone 17 Pro)CPU53225607
MacBook Pro M4GPU78351601623
Raspberry Pi 5CPU1337.81546

Advanced Orchestration Capabilities

LiteRT-LM includes built-in orchestration tools that manage output generation and continuity for sophisticated agentic workflows operating without cloud connectivity.

Multi-Token Prediction and Speed

The runtime integrates Multi-Token Prediction (MTP) and speculative decoding. Instead of generating a single token per pass, the model projects multiple subsequent tokens simultaneously, achieving up to a 2.2x speedup in on-device inference latency. This requires configuring the runtime to support larger speculative batch sizes, which marginally increases peak memory consumption.

Constrained Decoding and Thinking Mode

Applications requiring strict programmatic integration can enable Constrained Decoding to enforce guaranteed constrained JSON output. This prevents schema drift during long context generation. The framework also supports a Thinking Mode, which allocates additional compute cycles for step-by-step reasoning before generating a final response, mimicking the operation of server-side inference engines.

Session Save and Restore

For continuous applications, LiteRT-LM provides Session Save and Restore functionality. You can serialize the KV cache state to local storage and resume long-context conversations later without needing to reprocess the entire prompt history. This drastically reduces battery consumption and prefill times for intermittent background agents.

Platform APIs and Deployment

LiteRT-LM expands deployment far beyond Android environments. Apple developers can implement the engine using native Swift APIs, integrating directly with iOS and macOS application lifecycles. Web developers can access the runtime through WebGPU-accelerated JavaScript APIs, allowing serverless browser inference with near-native performance.

For specific implementation parameters, parameter lists, and backend initialization flags, refer to the LiteRT-LM documentation. Additionally, developers working with the Pixel 10 can utilize the Google Tensor SDK Beta to offload specific compute graphs directly to the device’s TPU, unlocking real-time inference for high-bandwidth tasks.

Tradeoffs and Limitations

Deploying large models to edge hardware requires strict power and thermal management. While the CPU backend on iOS achieves an impressive 607MB memory footprint, decode speeds plateau at 25 tokens per second. Relying entirely on GPU backends yields significantly faster prefill rates but increases total memory allocation, as seen on the MacBook Pro M4 pulling 1623MB. You must benchmark your specific application logic against your target device’s thermal throttling limits, as sustained generation will degrade performance over time.

To begin deployment, download the Gemma 4 weights from the Google AI Edge Gallery and configure your application build scripts to target the LiteRT-LM runtime layer.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading