DiffusionGemma Shifts 26B Local Inference to Parallel Decoding

Google DeepMind released DiffusionGemma, an experimental 26B model that replaces standard token-by-token generation with discrete text diffusion. Built on the Gemma 4 backbone, the open-weight release drafts and refines text in parallel blocks to bypass the memory bandwidth bottlenecks that typically constrain local throughput.

The architecture relies on a 26B parameter Mixture-of-Experts design with 3.8B parameters active during execution. Unlike autoregressive models that predict the next token sequentially, DiffusionGemma utilizes Uniform State Diffusion to populate a 256-token canvas simultaneously across multiple denoising steps.

Dual Attention and Multimodal Context

The model separates its attention mechanisms into two distinct phases. During the prefill stage, it uses standard causal attention to ingest prompts and write the key-value caches. Once generation begins, the model switches to bidirectional attention over the generation canvas. This allows every token in the 256-token block to attend to all other tokens simultaneously.

The system supports a 256K token context window. It natively accepts interleaved text, images of variable aspect ratios, and video inputs to generate text outputs. Audio processing is not supported in this release.

Hardware Benchmarks

DiffusionGemma is optimized specifically for single-user local inference workloads. By generating tokens in parallel, the model reduces the frequency of memory accesses required per token generated.

Hardware Setup	Performance Metric
NVIDIA H100 (FP8)	>1,000 tokens per second
NVIDIA GeForce RTX 5090	>700 tokens per second
Desktop RTX 6000 Pro	Up to 6.7x faster than Gemma 4

When quantized using NVFP4 or GGUF formats, the model fits within an 18GB to 24GB VRAM footprint. This memory profile makes the 26B parameter model runnable on consumer hardware like the RTX 4090 and RTX 5090.

Generation Dynamics and Tradeoffs

The shift to diffusion introduces fundamental changes to the generation user experience and output characteristics. Bidirectional attention enables Self-Correction, allowing the model to evaluate and fix logical consistencies across the entire text block during the denoising process. The release also includes a built-in Thinking Mode for step-by-step reasoning prior to final output.

Because the model outputs text in complete 256-token blocks rather than streaming individual words, developers must account for the “Polaroid effect.” This execution pattern results in higher first-token latency compared to autoregressive models, even though overall throughput is up to four times faster. Early benchmarks also indicate a slight degradation in reasoning quality compared to the standard Gemma 4 models.

Google launched the model under an Apache 2.0 license with immediate ecosystem integration. The release includes day-zero support for Hugging Face Transformers, MLX, NVIDIA NeMo, and vLLM via a new ModelState abstraction, with llama.cpp support currently in development.

If you build applications requiring real-time streaming text, the block-generation latency makes this model unsuitable for your workload. However, for offline batch processing, background agent tasks, or document summarization on local hardware, DiffusionGemma maximizes GPU utilization by trading initial response time for sustained throughput.

DiffusionGemma Shifts 26B Local Inference to Parallel Decoding

Dual Attention and Multimodal Context

Hardware Benchmarks

Generation Dynamics and Tradeoffs

Keep Reading

How to Speed Up MoE Fine-Tuning With NeMo AutoModel

How to Serve DiffusionGemma Locally With vLLM

OpenAI Releases 1.5B Privacy Filter MoE for PII Redaction

How to Configure Sparse-LoRA and DoRA With Hugging Face PEFT

How to Find GPU Gaps in PyTorch 2.12 With torch.profiler