TPU v5p Inference Speeds Triple With DFlash Block-Diffusion
Google and UCSD researchers released DFlash, a block-diffusion speculative decoding method that achieves a 3.13x average inference speedup on TPU v5p hardware.
Google Developers and researchers from the University of California, San Diego have released DFlash, a block-diffusion speculative decoding method optimized for Google TPU v5p hardware. The implementation achieves an average speedup of 3.13x over standard decoding, with peak performance gains nearing 6x on complex reasoning benchmarks. If you manage large-scale AI inference deployments, this architecture significantly lowers the latency floor for production environments.
Block-Diffusion Architecture
Standard speculative decoding methods like EAGLE-3 predict future tokens sequentially, resulting in O(K) computational complexity. DFlash paints an entire block of up to 16 candidate tokens simultaneously in a single parallel forward pass, shifting the complexity to O(1).
The lightweight diffusion-style draft model relies on a conditioning mechanism to maintain accuracy. It functions as a diffusion adapter, injecting hidden features from the larger frozen target model directly into the draft model’s KV cache. This allows the smaller drafter to utilize the existing reasoning capacity of the target model rather than generating logic from scratch.
To handle the inherent incompatibility between non-causal block diffusion and standard paged attention, the engineering team built a dual-cache architecture. The target model utilizes a paged KV cache powered by Pallas kernels, while the draft model relies on static on-device JAX arrays.
Performance Benchmarks
The research team integrated DFlash into the vLLM TPU Inference framework and tested it against previous state-of-the-art methods on TPU v5p accelerators. The UCSD team collaborated with Google Cloud engineers to optimize the underlying infrastructure. These hardware adjustments minimize “K-Flat” hardware verification costs, ensuring that memory bandwidth and TPU Matrix Multiplication Units (MXUs) remain fully saturated during the parallel drafting process.
| Task / Metric | DFlash Speedup | Previous SOTA (EAGLE-3) | Impact on Latency |
|---|---|---|---|
| Average Speedup | 3.13x | 1.30x | - |
| Math (math500) | ~5.7x | ~2.0x | 8.02ms → 1.40ms per token |
| Coding (humaneval) | >3.5x | - | - |
| Coding (mbpp) | 2.83x | - | 9.81ms → 3.48ms per token |
Ecosystem Availability
DFlash is now officially available within the open-source vLLM TPU ecosystem and the SGLang runtime. The UCSD researchers published over 14 pre-trained draft models on Hugging Face. The release includes support for popular architectures such as Qwen3 (8B, 30B), LLaMA-3.1, and Kimi K2.5.
Future updates will focus on scaling to wider draft blocks using the TPU RL Stack (Tunix) and MaxText. Engineers running high-throughput pipelines should monitor the team’s ongoing work on Speculative Speculative Decoding (SSD), which will introduce speculation caches to reduce LLM API costs in production even further.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Run IBM Granite 4.0 1B Speech for Multilingual Edge ASR and Translation
Learn how to deploy IBM Granite 4.0 1B Speech for fast multilingual ASR and translation on edge devices.
Boosting Kimi K2.5 Speed 3x via Cloudflare Infire Optimization
Cloudflare enhances Workers AI with the Infire engine, enabling extra-large models like Kimi K2.5 to run faster and more cost-effectively using Rust-based optimizations.
NVIDIA Introduces SPEED-Bench for Speculative Decoding
NVIDIA rolled out SPEED-Bench, a benchmark suite and dataset for evaluating speculative decoding across realistic LLM workloads.
$40 Billion Anthropic Deal Trades Equity for 1M Google TPUs
Anthropic will receive $10 billion in upfront cash and up to 1 million Ironwood TPUs in a $40 billion infrastructure agreement with Google.
Google Inks Multibillion GB300 Deal With Thinking Machines Lab
Google signed a multibillion-dollar agreement to provide Thinking Machines Lab with access to Nvidia GB300 infrastructure for reinforcement learning.