Gemma 4 Arrives With Full Apache 2.0 License
Google releases Gemma 4, a new generation of open models optimized for advanced reasoning, agentic workflows, and high-performance edge deployment.
Google DeepMind’s Gemma 4 release shifts the model family to a fully open Apache 2.0 license. The release introduces four variants designed for advanced reasoning and autonomous execution. By dropping the restrictive custom terms of previous generations, Google allows developers to deploy and redistribute these models without commercial constraints.
Model Variants and Tiers
The Gemma 4 lineup scales from edge devices to enterprise hardware across two distinct tiers. The models are derived directly from the proprietary Gemini 3 architecture.
| Model | Architecture | Context Window | Target Hardware |
|---|---|---|---|
| 31B Dense | Standard Dense | 256K | Workstation / Cloud |
| 26B A4B | Mixture-of-Experts (~4B active) | 256K | Workstation / Low Latency |
| Effective 4B (E4B) | Compact Dense | 128K | Laptops / High-end Mobile |
| Effective 2B (E2B) | Compact Dense | 128K | Smartphones / IoT |
The flagship 31B Dense model prioritizes raw reasoning capability and ranked third globally on the Arena AI text leaderboard at launch. The 26B A4B variant uses a sparse architecture, activating approximately 4 billion parameters per token to reduce inference latency while maintaining high output quality. The edge tier models, E4B and E2B, target consumer hardware and embedded systems like the Raspberry Pi and Jetson Nano.
Reasoning and Multimodal Architecture
The 31B Dense model achieves 89.2% on the AIME 2026 math benchmark. This performance relies on a new thinking mode that uses a dedicated <|channel>thought\n tag to output reasoning traces before generating a final response. For developers building systems that require autonomous execution, the models include native support for the system role and robust function calling capabilities.
Vision processing across all models relies on 2D spatial RoPE, which encodes image patch positions as specific x and y coordinates. Text generation uses a hybrid architecture that alternates between a sliding window and full attention at a 5:1 ratio. This structural design allows the Workstation tier to maintain its 256K context window while optimizing memory consumption.
The E4B and E2B models also process native audio through a conformer-based architecture. This allows the smaller models to perform offline edge ASR and translation directly on the device without routing through external text-to-speech APIs.
Framework Support and Ecosystem
Day-one support is available across major inference frameworks, including transformers, llama.cpp, MLX, and Unsloth. The community has already published 4-bit quantized versions (Q4_K_M) on Hugging Face.
The models are optimized for hardware ranging from the NVIDIA RTX 5090 and DGX Spark to the Apple Mac M3 Ultra and mobile platforms from Qualcomm and MediaTek. You can begin running these models locally immediately using standard open-source infrastructure. Early benchmark comparisons indicate the 31B model competes closely with Alibaba’s Qwen 3.5 27B on specific logic tasks. The E4B variant delivers high intelligence-per-parameter metrics, bringing frontier-level performance to consumer laptops.
If you build embedded AI applications or local desktop agents, the shift to Apache 2.0 simplifies your compliance requirements. You can now package and distribute the E2B and E4B models directly inside commercial mobile applications without relying on specialized enterprise licensing agreements.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Expose Ephemeral vLLM Endpoints on Hugging Face Jobs
Learn how to spin up temporary, OpenAI-compatible vLLM inference endpoints on Hugging Face serverless infrastructure using a single CLI command.
Gemini Omni Flash Unifies Video Generation at 10 Cents a Second
Google DeepMind has launched Nano Banana 2 Lite for rapid image generation and opened Gemini Omni Flash to developers for unified multimodal video editing.
Google Drops Vision Encoders in Gemma 4 12B Multimodal Release
Google DeepMind's new 12-billion parameter model uses a unified architecture to process text, image, and native audio directly on laptops with 16GB of RAM.
Google Ships 9 Gemini Omni Demos Alongside 3.5 Flash
Google has released nine demonstration videos showcasing Gemini Omni's physics-aware video generation and the benchmark results for Gemini 3.5 Flash.
Apache 2.0 Gets 218B Command A+ as Cohere Acquires Reliant AI
Cohere expanded its sovereign AI strategy by open-sourcing the 218-billion parameter Command A+ model and acquiring biopharma startup Reliant AI.