Ai Engineering 3 min read

Gemma 4 Arrives With Full Apache 2.0 License

Google releases Gemma 4, a new generation of open models optimized for advanced reasoning, agentic workflows, and high-performance edge deployment.

Google DeepMind’s Gemma 4 release shifts the model family to a fully open Apache 2.0 license. The release introduces four variants designed for advanced reasoning and autonomous execution. By dropping the restrictive custom terms of previous generations, Google allows developers to deploy and redistribute these models without commercial constraints.

Model Variants and Tiers

The Gemma 4 lineup scales from edge devices to enterprise hardware across two distinct tiers. The models are derived directly from the proprietary Gemini 3 architecture.

ModelArchitectureContext WindowTarget Hardware
31B DenseStandard Dense256KWorkstation / Cloud
26B A4BMixture-of-Experts (~4B active)256KWorkstation / Low Latency
Effective 4B (E4B)Compact Dense128KLaptops / High-end Mobile
Effective 2B (E2B)Compact Dense128KSmartphones / IoT

The flagship 31B Dense model prioritizes raw reasoning capability and ranked third globally on the Arena AI text leaderboard at launch. The 26B A4B variant uses a sparse architecture, activating approximately 4 billion parameters per token to reduce inference latency while maintaining high output quality. The edge tier models, E4B and E2B, target consumer hardware and embedded systems like the Raspberry Pi and Jetson Nano.

Reasoning and Multimodal Architecture

The 31B Dense model achieves 89.2% on the AIME 2026 math benchmark. This performance relies on a new thinking mode that uses a dedicated <|channel>thought\n tag to output reasoning traces before generating a final response. For developers building systems that require autonomous execution, the models include native support for the system role and robust function calling capabilities.

Vision processing across all models relies on 2D spatial RoPE, which encodes image patch positions as specific x and y coordinates. Text generation uses a hybrid architecture that alternates between a sliding window and full attention at a 5:1 ratio. This structural design allows the Workstation tier to maintain its 256K context window while optimizing memory consumption.

The E4B and E2B models also process native audio through a conformer-based architecture. This allows the smaller models to perform offline edge ASR and translation directly on the device without routing through external text-to-speech APIs.

Framework Support and Ecosystem

Day-one support is available across major inference frameworks, including transformers, llama.cpp, MLX, and Unsloth. The community has already published 4-bit quantized versions (Q4_K_M) on Hugging Face.

The models are optimized for hardware ranging from the NVIDIA RTX 5090 and DGX Spark to the Apple Mac M3 Ultra and mobile platforms from Qualcomm and MediaTek. You can begin running these models locally immediately using standard open-source infrastructure. Early benchmark comparisons indicate the 31B model competes closely with Alibaba’s Qwen 3.5 27B on specific logic tasks. The E4B variant delivers high intelligence-per-parameter metrics, bringing frontier-level performance to consumer laptops.

If you build embedded AI applications or local desktop agents, the shift to Apache 2.0 simplifies your compliance requirements. You can now package and distribute the E2B and E4B models directly inside commercial mobile applications without relying on specialized enterprise licensing agreements.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading