NVIDIA Nemotron-Labs-Diffusion Yields 6x TPF Over Qwen3-8B
NVIDIA has released the Nemotron-Labs-Diffusion model family, introducing a joint autoregressive and diffusion training objective to accelerate text generation.
On May 23, 2026, NVIDIA researchers detailed the Nemotron-Labs-Diffusion family on Hugging Face, introducing open-weight language models that transition text generation from a memory-bound to a compute-bound process. The release includes 3B, 8B, and 14B parameter models capable of switching between three distinct decoding modes using a single set of weights.
Tri-Mode Architecture
Unlike standard strictly autoregressive large language models, this architecture uses a joint AR-diffusion training objective. By modifying the attention pattern during inference, developers can swap decoding strategies without loading different weights into VRAM.
The Autoregressive (AR) Mode handles standard left-to-right decoding via causal attention, suited for high-concurrency cloud deployments. The Diffusion Mode shifts to parallel decoding to denoise multiple tokens simultaneously. This bypasses the autoregressive memory bottleneck and delivers higher Tokens Per Forward (TPF) passes.
The most efficient option is the Self-Speculation Mode. The model drafts a block of tokens using its internal diffusion capabilities and verifies them using its AR capabilities in the same forward pass. Because it utilizes a shared KV cache, this design eliminates the requirement for a separate, smaller drafter model. If you build custom AI inference pipelines, the joint objective allows you to toggle speculation based on live traffic loads.
Benchmark Results and Hardware Scaling
The shift to parallel decoding yields distinct throughput advantages over existing open-source architectures. The 8B parameter model achieves 5.9× to 6× higher TPF than Qwen3-8B while raising average accuracy by 1.2%.
In Self-Speculation mode, the model averages 6.82 tokens per draft step. This significantly outpaces existing speculative decoding architectures.
| Drafter Architecture | Acceptance Length (Tokens) |
|---|---|
| Eagle3 | 2.75 |
| MTP | 4.24 |
| Nemotron Self-Speculation | 6.82 |
Hardware-specific throughput metrics reflect the compute-bound design optimizations. On an NVIDIA GB200, the 8B model hits 850 tok/sec at a batch size of 1, resulting in a 3.3× speedup over AR-only modes. Custom CUDA kernels push this throughput to 1,015 tok/sec. On the DGX Spark platform, using w4a16 quantization, the 8B model reaches 112 tok/sec, which is 2.7× faster than standard AR decoding. This architectural shift complements other recent infrastructure optimizations, such as approaches to cut LLM memory use without quality loss.
Available Variants and Ecosystem Integrations
NVIDIA released dense variants in 3B, 8B, and 14B sizes across Base, Instruct, and Vision-Language (VLM) categories under the NVIDIA Nemotron Open Model License. The 14B model weights ship in BF16 tensor format.
The multimodal variant, Nemotron-Labs-Diffusion-VLM-8B, pairs the 8B tri-mode backbone with a 24-layer vision encoder featuring a 1024-hidden dimension for image-text-to-text tasks.
For deployment, the models integrate immediately with SGLang, vLLM, and Transformers (versions 5.0.0 and higher). Training code and recipes are distributed through the NVIDIA Megatron Bridge framework. If you manage a production cluster, verify your vLLM deployments are updated to correctly expose the parallel decoding parameters.
NVIDIA researchers project that optimized samplers could increase diffusion throughput by an additional 76.5% in future iterations. If you deploy high-throughput inference endpoints today, evaluating the Self-Speculation mode on the 8B variant offers a clear path to tripling your draft step acceptance rates without allocating VRAM to a secondary drafter model.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
Build a Fast Multilingual OCR with Nemotron-OCR-v2
Learn how to deploy NVIDIA Nemotron-OCR-v2 for high-speed document extraction across six languages using synthetic data and GPU acceleration.
Gemini 3.1 Flash-Lite Ships 1M Context at $0.25 Per Million
Google's lowest-latency Gemini model is now generally available, introducing variable thinking levels and a 1M token context window for high-volume routing.
AutoScientist Automates Simultaneous Data and Weight Tuning
Adaption launched AutoScientist to automate model fine-tuning by optimizing training datasets and model weights simultaneously.
CyberSecQwen-4B Defeats Cisco 8B on CTI-MCQ Benchmark
Team athena19 fine-tuned a 4-billion parameter model on a single AMD MI300X GPU that outperforms Cisco's 8B model for defensive cyber threat intelligence.
EMO Pretraining Decouples Mixture-of-Experts Subsets
AI2 and UC Berkeley researchers introduced EMO, a pretraining constraint that groups MoE experts by semantic domain to allow independent subnet deployment.