Gemma 4 Arrives With Full Apache 2.0 License
Google releases Gemma 4, a new generation of open models optimized for advanced reasoning, agentic workflows, and high-performance edge deployment.
Google DeepMind’s Gemma 4 release shifts the model family to a fully open Apache 2.0 license. The release introduces four variants designed for advanced reasoning and autonomous execution. By dropping the restrictive custom terms of previous generations, Google allows developers to deploy and redistribute these models without commercial constraints.
Model Variants and Tiers
The Gemma 4 lineup scales from edge devices to enterprise hardware across two distinct tiers. The models are derived directly from the proprietary Gemini 3 architecture.
| Model | Architecture | Context Window | Target Hardware |
|---|---|---|---|
| 31B Dense | Standard Dense | 256K | Workstation / Cloud |
| 26B A4B | Mixture-of-Experts (~4B active) | 256K | Workstation / Low Latency |
| Effective 4B (E4B) | Compact Dense | 128K | Laptops / High-end Mobile |
| Effective 2B (E2B) | Compact Dense | 128K | Smartphones / IoT |
The flagship 31B Dense model prioritizes raw reasoning capability and ranked third globally on the Arena AI text leaderboard at launch. The 26B A4B variant uses a sparse architecture, activating approximately 4 billion parameters per token to reduce inference latency while maintaining high output quality. The edge tier models, E4B and E2B, target consumer hardware and embedded systems like the Raspberry Pi and Jetson Nano.
Reasoning and Multimodal Architecture
The 31B Dense model achieves 89.2% on the AIME 2026 math benchmark. This performance relies on a new thinking mode that uses a dedicated <|channel>thought\n tag to output reasoning traces before generating a final response. For developers building systems that require autonomous execution, the models include native support for the system role and robust function calling capabilities.
Vision processing across all models relies on 2D spatial RoPE, which encodes image patch positions as specific x and y coordinates. Text generation uses a hybrid architecture that alternates between a sliding window and full attention at a 5:1 ratio. This structural design allows the Workstation tier to maintain its 256K context window while optimizing memory consumption.
The E4B and E2B models also process native audio through a conformer-based architecture. This allows the smaller models to perform offline edge ASR and translation directly on the device without routing through external text-to-speech APIs.
Framework Support and Ecosystem
Day-one support is available across major inference frameworks, including transformers, llama.cpp, MLX, and Unsloth. The community has already published 4-bit quantized versions (Q4_K_M) on Hugging Face.
The models are optimized for hardware ranging from the NVIDIA RTX 5090 and DGX Spark to the Apple Mac M3 Ultra and mobile platforms from Qualcomm and MediaTek. You can begin running these models locally immediately using standard open-source infrastructure. Early benchmark comparisons indicate the 31B model competes closely with Alibaba’s Qwen 3.5 27B on specific logic tasks. The E4B variant delivers high intelligence-per-parameter metrics, bringing frontier-level performance to consumer laptops.
If you build embedded AI applications or local desktop agents, the shift to Apache 2.0 simplifies your compliance requirements. You can now package and distribute the E2B and E4B models directly inside commercial mobile applications without relying on specialized enterprise licensing agreements.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
Train Multimodal Sentence Transformers for Visual Retrieval
Learn how to finetune multimodal embedding and reranker models for text, image, and audio using the updated Sentence Transformers library.
OlmoEarth v1.1 Tops DINOv3 in Remote Sensing Benchmarks
Ai2 updated its multimodal Earth observation models with OlmoEarth v1.1, bringing enhanced training efficiency and state-of-the-art benchmark performance.
DeepMind's Alignment Bet: More Test-Time Compute
Google DeepMind researchers have published a study demonstrating that video and language model alignment dramatically improves through test-time scaling.
Single-Weight Gemini Omni Unifies Multimodal Video Generation
Google's Gemini Omni collapses text, image, audio, and video generation into a single set of model weights to enable conversational video editing.
8K Context Reranking Hits Hugging Face With Ettin Cross-Encoders
Hugging Face released six open-source cross-encoders under the Ettin Reranker family with an 8,192-token context window for long-form document retrieval.