TPU v5p Inference Speeds Triple With DFlash Block-Diffusion

Google Developers and researchers from the University of California, San Diego have released DFlash, a block-diffusion speculative decoding method optimized for Google TPU v5p hardware. The implementation achieves an average speedup of 3.13x over standard decoding, with peak performance gains nearing 6x on complex reasoning benchmarks. If you manage large-scale AI inference deployments, this architecture significantly lowers the latency floor for production environments.

Block-Diffusion Architecture

Standard speculative decoding methods like EAGLE-3 predict future tokens sequentially, resulting in O(K) computational complexity. DFlash paints an entire block of up to 16 candidate tokens simultaneously in a single parallel forward pass, shifting the complexity to O(1).

The lightweight diffusion-style draft model relies on a conditioning mechanism to maintain accuracy. It functions as a diffusion adapter, injecting hidden features from the larger frozen target model directly into the draft model’s KV cache. This allows the smaller drafter to utilize the existing reasoning capacity of the target model rather than generating logic from scratch.

To handle the inherent incompatibility between non-causal block diffusion and standard paged attention, the engineering team built a dual-cache architecture. The target model utilizes a paged KV cache powered by Pallas kernels, while the draft model relies on static on-device JAX arrays.

Performance Benchmarks

The research team integrated DFlash into the vLLM TPU Inference framework and tested it against previous state-of-the-art methods on TPU v5p accelerators. The UCSD team collaborated with Google Cloud engineers to optimize the underlying infrastructure. These hardware adjustments minimize “K-Flat” hardware verification costs, ensuring that memory bandwidth and TPU Matrix Multiplication Units (MXUs) remain fully saturated during the parallel drafting process.

Task / Metric	DFlash Speedup	Previous SOTA (EAGLE-3)	Impact on Latency
Average Speedup	3.13x	1.30x	-
Math (math500)	~5.7x	~2.0x	8.02ms → 1.40ms per token
Coding (humaneval)	>3.5x	-	-
Coding (mbpp)	2.83x	-	9.81ms → 3.48ms per token

Ecosystem Availability

DFlash is now officially available within the open-source vLLM TPU ecosystem and the SGLang runtime. The UCSD researchers published over 14 pre-trained draft models on Hugging Face. The release includes support for popular architectures such as Qwen3 (8B, 30B), LLaMA-3.1, and Kimi K2.5.

Future updates will focus on scaling to wider draft blocks using the TPU RL Stack (Tunix) and MaxText. Engineers running high-throughput pipelines should monitor the team’s ongoing work on Speculative Speculative Decoding (SSD), which will introduce speculation caches to reduce LLM API costs in production even further.

TPU v5p Inference Speeds Triple With DFlash Block-Diffusion

Block-Diffusion Architecture

Performance Benchmarks

Ecosystem Availability

Keep Reading

How to Run IBM Granite 4.0 1B Speech for Multilingual Edge ASR and Translation

Boosting Kimi K2.5 Speed 3x via Cloudflare Infire Optimization

NVIDIA Introduces SPEED-Bench for Speculative Decoding

$40 Billion Anthropic Deal Trades Equity for 1M Google TPUs

Google Inks Multibillion GB300 Deal With Thinking Machines Lab