NVIDIA Introduces SPEED-Bench for Speculative Decoding
NVIDIA rolled out SPEED-Bench, a benchmark suite and dataset for evaluating speculative decoding across realistic LLM workloads.
NVIDIA introduced SPEED-Bench on March 19 as a public benchmark suite for speculative decoding, with a dataset and measurement framework designed around real serving conditions rather than toy prompt sets. If you run LLM inference in production, the release matters because it changes what a credible speculative decoding evaluation looks like: long contexts, realistic concurrency, diverse tasks, and engine-level measurement across TensorRT-LLM, vLLM, and SGLang.
Benchmark Design
SPEED-Bench has two dataset splits plus a serving-oriented evaluation framework.
The Qualitative split contains 880 prompts, arranged as 11 categories with 80 samples each: Coding, Math, Humanities, STEM, Writing, Summarization, Roleplay, RAG, Multilingual, Reasoning, and QA. Prompt selection was optimized for semantic diversity using embeddings from openai/text-embedding-3-small, which keeps category coverage broader than the narrow prompt collections commonly used for speculative decoding tests.
The Throughput split targets serving behavior under load. It uses 5 input sequence length buckets from 1k to 32k tokens, with 1,536 prompts per bucket, organized as 512 samples across 3 difficulty bands. NVIDIA’s design target also supports batch sizes up to 512.
A key systems choice is that prompts are externalized and pre-tokenized before they reach the inference engine. This removes hidden variation from chat templates, BOS handling, and tokenizer differences, which is exactly the kind of evaluation drift that can quietly corrupt cross-engine comparisons if you do not control for tokenization and prompt formatting.
What SPEED-Bench Measures
SPEED-Bench focuses on the metrics that decide whether speculative decoding actually helps in production.
It measures conditional acceptance behavior, acceptance length, throughput, output tokens per second, user TPS, and request timing from streaming responses. Those metrics are more useful than isolated decode speed because speculative decoding only pays off when acceptance is high enough to offset validation cost.
For teams already tracking LLM observability or tuning streaming responses, this is the practical shift. You need both model-side acceptance metrics and system-side serving metrics, otherwise a drafter can look efficient in isolation while hurting end-to-end latency under concurrency.
Reported Results
The most important result is that benchmark composition changes the conclusion.
At batch size 32 and draft length 3, NVIDIA reports the following mean acceptance lengths and speedups on the qualitative benchmark:
| Target model | Method | Mean acceptance length | Mean speedup |
|---|---|---|---|
| Llama 3.3 70B | N-Gram | 1.41 | 0.88x |
| Llama 3.3 70B | Vanilla SD | 2.44 | 1.60x |
| Llama 3.3 70B | EAGLE3 | 2.44 | 1.90x |
| GPT-OSS 120B | N-Gram | 1.31 | 0.29x |
| GPT-OSS 120B | EAGLE3 | 2.25 | 1.34x |
| DeepSeek R1 | MTP | 2.55 | 1.45x |
| Qwen3 235B | Vanilla SD | 2.43 | 1.17x |
| Qwen3 235B | EAGLE3 | 2.22 | 1.33x |
| Qwen3-Next | MTP | 2.81 | 1.20x |
Two patterns stand out.
First, N-Gram speculation can become a net slowdown at realistic concurrency. NVIDIA explicitly observed this at batch size 32, where acceptance was too low to justify validation overhead.
Second, stronger approaches such as EAGLE3 and MTP still clear 1x speedup in several settings, but the gains are narrower and more conditional than many synthetic benchmarks imply.
Domain Effects
SPEED-Bench shows that speculative decoding quality varies materially by task domain.
For Llama 3.3 70B with EAGLE3 at temperature 0, average acceptance length reached 3.00 in Coding and 2.45 in Math, but dropped to 2.04 in Roleplay and 1.71 in Multilingual. Lower-entropy tasks accept longer drafted spans. Higher-entropy tasks do not.
If you build coding assistants, retrieval-heavy systems, or structured generation pipelines, this is where the benchmark becomes operationally useful. Your expected speedup depends on workload mix, not just on the target model and drafter pair. The same lesson shows up in context engineering and RAG system design: workload shape dominates average-case claims.
Throughput Realism
NVIDIA also argues that random-token throughput tests overestimate real-world performance, especially for MoE models.
The GPT-OSS 120B analysis highlights why. Random inputs alter expert routing behavior and distort the number of unique activated experts, which means synthetic inputs can produce throughput numbers that do not resemble production traffic. For inference teams working on MoE serving, this is a direct warning against benchmark shortcuts.
The release includes a concrete TensorRT-LLM example using meta-llama/Llama-3.3-70B-Instruct with yuhuili/EAGLE3-LLaMA3.3-Instruct-70B on 8×H100 at concurrency 32. Reported results were 2.4511 average acceptance length, 2518.1464 output TPS, 314.7683 output TPS per GPU, 0.1217 seconds mean TTFT, and 85.3162 mean request generation TPS.
Comparison With Prior Benchmark Practice
NVIDIA positions SPEED-Bench as broader than SpecBench on the dimensions that matter for deployment.
| Benchmark | Data sources | Avg pairwise similarity | Multiturn prompts | Max turns | Long context support |
|---|---|---|---|---|---|
| SPEED-Bench | 24 | 0.14 | 167 | 5 | Up to 32k |
| SpecBench | 5 | 0.22 | 80 | 2 | No 16k to 32k support |
This is the useful distinction. SPEED-Bench is built to expose where speculative decoding breaks under semantic diversity, long context, and concurrency, not just where it looks fast on curated happy-path prompts.
If you evaluate speculative decoding for production, stop using batch-size-1 or random-token tests as your decision point. Use a workload mix that resembles your traffic, pin tokenization and prompt formatting, and tune draft length by concurrency rather than assuming one global optimum.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Run NVIDIA Nemotron 3 Nano 4B Locally on Jetson and RTX
Learn to deploy NVIDIA's Nemotron 3 Nano 4B locally with BF16, FP8, or GGUF on Jetson, RTX, vLLM, TensorRT-LLM, and llama.cpp.
NVIDIA Launches Nemotron Coalition at GTC 2026
NVIDIA launched the Nemotron Coalition and expanded its open AI model lineup at GTC 2026, with the first coalition model set for Nemotron 4.
How to Get Started with Open-H, GR00T-H, and Cosmos-H for Healthcare Robotics Research
Learn how to use NVIDIA's new Open-H dataset and GR00T-H and Cosmos-H models to build and evaluate healthcare robotics systems.
How to Run IBM Granite 4.0 1B Speech for Multilingual Edge ASR and Translation
Learn how to deploy IBM Granite 4.0 1B Speech for fast multilingual ASR and translation on edge devices.
NVIDIA Unveils NemoClaw at GTC as a Security-Focused Enterprise AI Agent Platform
NVIDIA introduced NemoClaw, an alpha open-source enterprise agent platform built to add security and privacy controls to OpenClaw workflows.