DiffusionGemma Shifts 26B Local Inference to Parallel Decoding
Google's 26B Mixture-of-Experts model abandons autoregressive generation for parallel text diffusion to hit 700 tokens per second on consumer GPUs.
Google DeepMind released DiffusionGemma, an experimental 26B model that replaces standard token-by-token generation with discrete text diffusion. Built on the Gemma 4 backbone, the open-weight release drafts and refines text in parallel blocks to bypass the memory bandwidth bottlenecks that typically constrain local throughput.
The architecture relies on a 26B parameter Mixture-of-Experts design with 3.8B parameters active during execution. Unlike autoregressive models that predict the next token sequentially, DiffusionGemma utilizes Uniform State Diffusion to populate a 256-token canvas simultaneously across multiple denoising steps.
Dual Attention and Multimodal Context
The model separates its attention mechanisms into two distinct phases. During the prefill stage, it uses standard causal attention to ingest prompts and write the key-value caches. Once generation begins, the model switches to bidirectional attention over the generation canvas. This allows every token in the 256-token block to attend to all other tokens simultaneously.
The system supports a 256K token context window. It natively accepts interleaved text, images of variable aspect ratios, and video inputs to generate text outputs. Audio processing is not supported in this release.
Hardware Benchmarks
DiffusionGemma is optimized specifically for single-user local inference workloads. By generating tokens in parallel, the model reduces the frequency of memory accesses required per token generated.
| Hardware Setup | Performance Metric |
|---|---|
| NVIDIA H100 (FP8) | >1,000 tokens per second |
| NVIDIA GeForce RTX 5090 | >700 tokens per second |
| Desktop RTX 6000 Pro | Up to 6.7x faster than Gemma 4 |
When quantized using NVFP4 or GGUF formats, the model fits within an 18GB to 24GB VRAM footprint. This memory profile makes the 26B parameter model runnable on consumer hardware like the RTX 4090 and RTX 5090.
Generation Dynamics and Tradeoffs
The shift to diffusion introduces fundamental changes to the generation user experience and output characteristics. Bidirectional attention enables Self-Correction, allowing the model to evaluate and fix logical consistencies across the entire text block during the denoising process. The release also includes a built-in Thinking Mode for step-by-step reasoning prior to final output.
Because the model outputs text in complete 256-token blocks rather than streaming individual words, developers must account for the “Polaroid effect.” This execution pattern results in higher first-token latency compared to autoregressive models, even though overall throughput is up to four times faster. Early benchmarks also indicate a slight degradation in reasoning quality compared to the standard Gemma 4 models.
Google launched the model under an Apache 2.0 license with immediate ecosystem integration. The release includes day-zero support for Hugging Face Transformers, MLX, NVIDIA NeMo, and vLLM via a new ModelState abstraction, with llama.cpp support currently in development.
If you build applications requiring real-time streaming text, the block-generation latency makes this model unsuitable for your workload. However, for offline batch processing, background agent tasks, or document summarization on local hardware, DiffusionGemma maximizes GPU utilization by trading initial response time for sustained throughput.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Speed Up MoE Fine-Tuning With NeMo AutoModel
Learn how to configure NVIDIA NeMo AutoModel in Transformers v5 to increase MoE training throughput and reduce GPU memory usage.
How to Serve DiffusionGemma Locally With vLLM
Learn how to deploy Google's 26B text diffusion model on local hardware to achieve massive parallel generation speeds using vLLM and Hugging Face.
OpenAI Releases 1.5B Privacy Filter MoE for PII Redaction
OpenAI released an open-weight, 1.5 billion parameter model designed to detect and redact personally identifiable information locally before cloud processing.
How to Configure Sparse-LoRA and DoRA With Hugging Face PEFT
Learn how to use PEFT 0.18.0 to configure Sparse-LoRA, DoRA, LoRA-XS, and rsLoRA for more efficient fine-tuning on single-GPU hardware.
How to Find GPU Gaps in PyTorch 2.12 With torch.profiler
Learn how to identify performance bottlenecks and idle GPU lanes using the native torch.profiler in PyTorch 2.12 across Blackwell and AMD hardware.