Ai Engineering 5 min read

NVIDIA Introduces SPEED-Bench for Speculative Decoding

NVIDIA rolled out SPEED-Bench, a benchmark suite and dataset for evaluating speculative decoding across realistic LLM workloads.

NVIDIA introduced SPEED-Bench on March 19 as a public benchmark suite for speculative decoding, with a dataset and measurement framework designed around real serving conditions rather than toy prompt sets. If you run LLM inference in production, the release matters because it changes what a credible speculative decoding evaluation looks like: long contexts, realistic concurrency, diverse tasks, and engine-level measurement across TensorRT-LLM, vLLM, and SGLang.

Benchmark Design

SPEED-Bench has two dataset splits plus a serving-oriented evaluation framework.

The Qualitative split contains 880 prompts, arranged as 11 categories with 80 samples each: Coding, Math, Humanities, STEM, Writing, Summarization, Roleplay, RAG, Multilingual, Reasoning, and QA. Prompt selection was optimized for semantic diversity using embeddings from openai/text-embedding-3-small, which keeps category coverage broader than the narrow prompt collections commonly used for speculative decoding tests.

The Throughput split targets serving behavior under load. It uses 5 input sequence length buckets from 1k to 32k tokens, with 1,536 prompts per bucket, organized as 512 samples across 3 difficulty bands. NVIDIA’s design target also supports batch sizes up to 512.

A key systems choice is that prompts are externalized and pre-tokenized before they reach the inference engine. This removes hidden variation from chat templates, BOS handling, and tokenizer differences, which is exactly the kind of evaluation drift that can quietly corrupt cross-engine comparisons if you do not control for tokenization and prompt formatting.

What SPEED-Bench Measures

SPEED-Bench focuses on the metrics that decide whether speculative decoding actually helps in production.

It measures conditional acceptance behavior, acceptance length, throughput, output tokens per second, user TPS, and request timing from streaming responses. Those metrics are more useful than isolated decode speed because speculative decoding only pays off when acceptance is high enough to offset validation cost.

For teams already tracking LLM observability or tuning streaming responses, this is the practical shift. You need both model-side acceptance metrics and system-side serving metrics, otherwise a drafter can look efficient in isolation while hurting end-to-end latency under concurrency.

Reported Results

The most important result is that benchmark composition changes the conclusion.

At batch size 32 and draft length 3, NVIDIA reports the following mean acceptance lengths and speedups on the qualitative benchmark:

Target modelMethodMean acceptance lengthMean speedup
Llama 3.3 70BN-Gram1.410.88x
Llama 3.3 70BVanilla SD2.441.60x
Llama 3.3 70BEAGLE32.441.90x
GPT-OSS 120BN-Gram1.310.29x
GPT-OSS 120BEAGLE32.251.34x
DeepSeek R1MTP2.551.45x
Qwen3 235BVanilla SD2.431.17x
Qwen3 235BEAGLE32.221.33x
Qwen3-NextMTP2.811.20x

Two patterns stand out.

First, N-Gram speculation can become a net slowdown at realistic concurrency. NVIDIA explicitly observed this at batch size 32, where acceptance was too low to justify validation overhead.

Second, stronger approaches such as EAGLE3 and MTP still clear 1x speedup in several settings, but the gains are narrower and more conditional than many synthetic benchmarks imply.

Domain Effects

SPEED-Bench shows that speculative decoding quality varies materially by task domain.

For Llama 3.3 70B with EAGLE3 at temperature 0, average acceptance length reached 3.00 in Coding and 2.45 in Math, but dropped to 2.04 in Roleplay and 1.71 in Multilingual. Lower-entropy tasks accept longer drafted spans. Higher-entropy tasks do not.

If you build coding assistants, retrieval-heavy systems, or structured generation pipelines, this is where the benchmark becomes operationally useful. Your expected speedup depends on workload mix, not just on the target model and drafter pair. The same lesson shows up in context engineering and RAG system design: workload shape dominates average-case claims.

Throughput Realism

NVIDIA also argues that random-token throughput tests overestimate real-world performance, especially for MoE models.

The GPT-OSS 120B analysis highlights why. Random inputs alter expert routing behavior and distort the number of unique activated experts, which means synthetic inputs can produce throughput numbers that do not resemble production traffic. For inference teams working on MoE serving, this is a direct warning against benchmark shortcuts.

The release includes a concrete TensorRT-LLM example using meta-llama/Llama-3.3-70B-Instruct with yuhuili/EAGLE3-LLaMA3.3-Instruct-70B on 8×H100 at concurrency 32. Reported results were 2.4511 average acceptance length, 2518.1464 output TPS, 314.7683 output TPS per GPU, 0.1217 seconds mean TTFT, and 85.3162 mean request generation TPS.

Comparison With Prior Benchmark Practice

NVIDIA positions SPEED-Bench as broader than SpecBench on the dimensions that matter for deployment.

BenchmarkData sourcesAvg pairwise similarityMultiturn promptsMax turnsLong context support
SPEED-Bench240.141675Up to 32k
SpecBench50.22802No 16k to 32k support

This is the useful distinction. SPEED-Bench is built to expose where speculative decoding breaks under semantic diversity, long context, and concurrency, not just where it looks fast on curated happy-path prompts.

If you evaluate speculative decoding for production, stop using batch-size-1 or random-token tests as your decision point. Use a workload mix that resembles your traffic, pin tokenization and prompt formatting, and tune draft length by concurrency rather than assuming one global optimum.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading