Ai Engineering 3 min read

AI Edge Gallery for Android Gains On-Device MCP and Gemma 4

Google updated the AI Edge Gallery Android app with experimental Model Context Protocol support, enabling on-device Gemma 4 models to use external web tools.

On May 19, Google updated the Google AI Edge Gallery Android application with experimental Model Context Protocol support. Unveiled at Google I/O 2026, the release transforms the reference application from a standard chat interface into a testing environment for connected capabilities powered by the Gemma 4 model family.

Model Context Protocol Integration

The Android application now implements experimental support for the Model Context Protocol (MCP) over Streamable HTTP, with iOS support pending. Developers can register an MCP server URL, allowing the application to dynamically load tool definitions and resource schemas directly into the system prompt.

This architecture enforces a strict boundary for on-device reasoning. The tools themselves execute on external servers or cloud endpoints. The decision-making process and tool-call generation occur entirely locally on the device using Gemma 4. The initial verified implementations include a Google Workspace connection for querying Gmail and Calendar, a Google Maps integration for natural language location searches, and a web fetch protocol for retrieving live URL content. This setup isolates the application logic from the intelligence layer, establishing a clear pattern for how the Model Context Protocol can operate on mobile hardware.

Proactive Notifications and Deep Linking

The gallery introduces a “Schedule Notification” capability, shifting the interface from strictly reactive chat to proactive execution. Users can instruct the model to create routines, such as logging a mood every night at 10 PM. The application generates a local notification that, when activated, deep-links into an active session with Gemma 4 loaded with the relevant task context and insights.

Google positions the repository as an open-source reference for developers building mobile AI experiences. The platform includes a framework for community-contributed agent skills that extend the underlying model. This modularity allows developers to plug in custom tools without rewriting the core application logic.

LiteRT-LM Performance Upgrades

Maintaining long-running workflows requires rapid state restoration. The application uses the updated LiteRT-LM engine to achieve prefill speeds exceeding 3,000 tokens per second on current mobile GPUs, such as the Samsung S26 Ultra. This speed allows the application to restore massive chat contexts almost instantly when a user resumes a backgrounded session.

The system uses the Gemma 4 E2B (Effective 2 Billion) and E4B (Effective 4 Billion) variants. These models are optimized specifically for mobile reasoning and structured JSON outputs. The deployment also incorporates Multi-Token Prediction (MTP), delivering up to a 2.2x decode speedup on mobile hardware to make complex planning tasks feel responsive. Developers can manipulate these parameters and edit custom system prompts directly in the chat settings to test output constraints.

If you build local AI applications, clone the updated GitHub repository to review how Google handles tool calling within a mobile environment. The codebase serves as a direct implementation guide for routing MCP traffic through an Android application layer without relying on cloud-hosted models.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading