Google Drops Vision Encoders in Gemma 4 12B Multimodal Release

Google DeepMind’s Gemma 4 12B release introduces a dense multimodal model that discards traditional separate vision and audio encoders. By projecting sensory data directly into the LLM backbone, the 11.95-billion parameter model allows developers to run high-performance multimodal reasoning on standard 16GB consumer hardware.

Unified Transformer Architecture

Traditional multimodal systems rely on large, frozen encoders to process inputs before reaching the text model. Gemma 4 12B uses a Unified Transformer design that eliminates this overhead. It replaces the standard 550-million parameter vision encoder with a lightweight 35-million parameter embedding module. This module projects raw 48x48 pixel patches straight into the LLM’s hidden dimension using a single matrix multiplication and factorized coordinate lookups for spatial positioning.

The audio integration follows the same pipeline. The model ingests raw audio signals and maps them into the exact same dimensional space as text tokens. This makes it the first medium-sized Gemma variant with native audio support, removing the need to bolt on a secondary speech-to-text model.

Component	Traditional Architectures	Gemma 4 12B Unified
Vision Pipeline	550M parameter encoder	35M parameter embedder
Audio Pipeline	300M parameter encoder	Native token projection
Parameters	Variable	11.95 billion
Context Window	Variable	256K tokens

Hardware Constraints and Capabilities

The architecture explicitly targets laptops with 16GB of unified memory or VRAM, including Apple Silicon Macs and enterprise Windows machines. To fit this memory footprint while maintaining its 256K context window, the model requires 8-bit or 4-bit quantization.

Despite the smaller parameter count and lack of heavy encoders, Gemma 4 12B outperforms the previous generation Gemma 3 27B on GPQA and coding benchmarks. The model natively supports function calling and a thinking mode for step-by-step reasoning. To address latency, Google released the model alongside a dedicated Multi-Token Prediction (MTP) drafter model designed to accelerate local inference speeds.

Distribution and Tooling

Gemma 4 12B is licensed under Apache 2.0. The weights are available on Hugging Face under the repository google/gemma-4-12B-it and on Kaggle.

For developers aiming to run Gemma 4 on-device, Google provides support through llama.cpp, Ollama, vLLM, MLX, and LM Studio. The LiteRT-LM CLI introduces a new litert-lm serve command, allowing you to expose an OpenAI-compatible local endpoint immediately. Google also published macOS desktop applications through the Google AI Edge Gallery to handle offline voice dictation and visual analysis without cloud APIs.

If you build local assistants, the unified architecture drastically changes your memory budget. By dropping heavy multi-stage encoders, you reclaim VRAM for broader context windows and native audio processing directly on standard laptops.

Google Drops Vision Encoders in Gemma 4 12B Multimodal Release

Unified Transformer Architecture

Hardware Constraints and Capabilities

Distribution and Tooling

Keep Reading

How to Run Gemma 4 On-Device with LiteRT-LM

AI Edge Gallery for Android Gains On-Device MCP and Gemma 4

Google Graduates LiteRT NPU Acceleration to Production

Google AI Edge Eloquent brings free offline dictation to iOS

Gemma 4 Arrives With Full Apache 2.0 License