NVIDIA Nemotron-Labs-Diffusion Yields 6x TPF Over Qwen3-8B

On May 23, 2026, NVIDIA researchers detailed the Nemotron-Labs-Diffusion family on Hugging Face, introducing open-weight language models that transition text generation from a memory-bound to a compute-bound process. The release includes 3B, 8B, and 14B parameter models capable of switching between three distinct decoding modes using a single set of weights.

Tri-Mode Architecture

Unlike standard strictly autoregressive large language models, this architecture uses a joint AR-diffusion training objective. By modifying the attention pattern during inference, developers can swap decoding strategies without loading different weights into VRAM.

The Autoregressive (AR) Mode handles standard left-to-right decoding via causal attention, suited for high-concurrency cloud deployments. The Diffusion Mode shifts to parallel decoding to denoise multiple tokens simultaneously. This bypasses the autoregressive memory bottleneck and delivers higher Tokens Per Forward (TPF) passes.

The most efficient option is the Self-Speculation Mode. The model drafts a block of tokens using its internal diffusion capabilities and verifies them using its AR capabilities in the same forward pass. Because it utilizes a shared KV cache, this design eliminates the requirement for a separate, smaller drafter model. If you build custom AI inference pipelines, the joint objective allows you to toggle speculation based on live traffic loads.

Benchmark Results and Hardware Scaling

The shift to parallel decoding yields distinct throughput advantages over existing open-source architectures. The 8B parameter model achieves 5.9× to 6× higher TPF than Qwen3-8B while raising average accuracy by 1.2%.

In Self-Speculation mode, the model averages 6.82 tokens per draft step. This significantly outpaces existing speculative decoding architectures.

Drafter Architecture	Acceptance Length (Tokens)
Eagle3	2.75
MTP	4.24
Nemotron Self-Speculation	6.82

Hardware-specific throughput metrics reflect the compute-bound design optimizations. On an NVIDIA GB200, the 8B model hits 850 tok/sec at a batch size of 1, resulting in a 3.3× speedup over AR-only modes. Custom CUDA kernels push this throughput to 1,015 tok/sec. On the DGX Spark platform, using w4a16 quantization, the 8B model reaches 112 tok/sec, which is 2.7× faster than standard AR decoding. This architectural shift complements other recent infrastructure optimizations, such as approaches to cut LLM memory use without quality loss.

Available Variants and Ecosystem Integrations

NVIDIA released dense variants in 3B, 8B, and 14B sizes across Base, Instruct, and Vision-Language (VLM) categories under the NVIDIA Nemotron Open Model License. The 14B model weights ship in BF16 tensor format.

The multimodal variant, Nemotron-Labs-Diffusion-VLM-8B, pairs the 8B tri-mode backbone with a 24-layer vision encoder featuring a 1024-hidden dimension for image-text-to-text tasks.

For deployment, the models integrate immediately with SGLang, vLLM, and Transformers (versions 5.0.0 and higher). Training code and recipes are distributed through the NVIDIA Megatron Bridge framework. If you manage a production cluster, verify your vLLM deployments are updated to correctly expose the parallel decoding parameters.

NVIDIA researchers project that optimized samplers could increase diffusion throughput by an additional 76.5% in future iterations. If you deploy high-throughput inference endpoints today, evaluating the Self-Speculation mode on the 8B variant offers a clear path to tripling your draft step acceptance rates without allocating VRAM to a secondary drafter model.

NVIDIA Nemotron-Labs-Diffusion Yields 6x TPF Over Qwen3-8B

Tri-Mode Architecture

Benchmark Results and Hardware Scaling

Available Variants and Ecosystem Integrations

Keep Reading

Build a Fast Multilingual OCR with Nemotron-OCR-v2

Gemini 3.1 Flash-Lite Ships 1M Context at $0.25 Per Million

AutoScientist Automates Simultaneous Data and Weight Tuning

CyberSecQwen-4B Defeats Cisco 8B on CTI-MCQ Benchmark

EMO Pretraining Decouples Mixture-of-Experts Subsets