How to Optimize MoE Inference with Warp Decode

Cursor’s new warp decode kernel optimization lets you run Mixture-of-Experts models at over 300 tokens per second on Blackwell hardware. It eliminates the memory bandwidth bottlenecks that typically slow down token routing in frontier coding models. This guide covers how the custom GPU kernels bypass standard memory hierarchies, the integration of MXFP8 compression, and the pricing structures for deploying these optimizations in production.

The Mixture-of-Experts Latency Problem

Running inference on Mixture-of-Experts models introduces severe memory bandwidth constraints. Each token passing through the model must be routed to specific feed-forward network layers called experts. This routing requires constant data movement between the global memory of the GPU and its registers.

Standard implementations process this movement using generalized CUDA libraries like cuBLAS. The reliance on these libraries creates latency overhead during the routing phase. Token generation slows down because the compute units sit idle waiting for data to traverse the memory hierarchy.

These architectural bottlenecks limit total tokens per second regardless of raw compute power. Anyone deploying MoE architectures must address this memory stall time to achieve frontier-level speeds.

Rewriting the MoE Kernel

Warp decode solves the routing bottleneck through a complete rewrite of the MoE layer at the GPU kernel level. The implementation removes all dependencies on standard CUDA libraries.

The custom kernels rely exclusively on warp-level primitives. A warp consists of a group of threads executing concurrently on the GPU. Warp decode allows these threads to exchange data directly with one another.

Direct warp communication bypasses both shared memory and global memory structures entirely. Tokens move through the expert routing phase without triggering memory stall time. The elimination of this data travel translates directly into sustained throughput during the forward pass.

Integrating MXFP8 Compression

MoE models often suffer from a performance degradation known as death by a thousand quantizations. This occurs when the inference engine constantly dequantizes data during the forward pass to perform calculations. The repeated format conversion consumes valuable processing cycles.

Warp decode prevents this by integrating Microscaled FP8 (MXFP8) kernels directly into the decoding loop. The model remains in its compressed format for much longer periods. Calculations happen closer to the native compressed state.

Bypassing continuous dequantization cycles preserves the speed gains generated by the warp-level communication. You can read more about standard quantization strategies to understand how data formats dictate inference speeds.

Blackwell Hardware Optimization

Traditional dequantization methods lose their efficiency on the newest generation of NVIDIA hardware. Hopper architectures handled standard memory movement well, but Blackwell introduces a different memory paradigm.

Blackwell GPUs feature a Tensor Memory (TMEM) architecture. The warp decode implementation is explicitly designed to leverage TMEM for MoE execution. Standard MegaBlocks implementations fail to utilize this specialized memory structure effectively.

Running warp decode on a B200 GPU executes the MoE layer forward pass 3.5x faster than a standard MegaBlocks setup. This kernel-level optimization is a necessary shift for teams scaling AI inference on Blackwell hardware.

Benchmarks and Performance

The combination of TMEM utilization, MXFP8 integration, and warp-level primitives results in a 1.5x to 2x end-to-end speedup in production environments.

The optimizations power the Composer 2 fast variant, pushing its generation speed past 300 tokens per second. This speed facilitates near-instantaneous code generation for complex programming tasks.

These inference improvements correlate with high benchmark scores. The Composer 2 implementation secured the number one rank on Terminal-Bench 2.0, an agent evaluation framework maintained by the Laude Institute. It also secured a top-three placement on SWE-bench Verified.

Real-Time RL Training

The underlying models receive continuous updates through a process called real-time RL. The training infrastructure collects billions of tokens from user interactions.

This interaction data feeds directly back into the training loop. The system ships an improved model checkpoint as frequently as every five hours. The warp decode kernels ensure that these rapid checkpoints load and execute efficiently in the production environment.

Model Variants and Pricing

Cursor exposes these optimizations through two specific API tiers for Composer 2. The standard model uses traditional routing, while the fast variant runs exclusively on the warp decode architecture.

Model Variant	Input Cost (per 1M tokens)	Output Cost (per 1M tokens)
Composer 2	$0.50	$2.50
Composer 2 fast	$1.50	$7.50

The fast variant costs three times as much as the standard model. The price premium covers the specific B200 hardware requirements and the advanced kernel routing required to sustain 300 TPS.

Evaluate your application’s latency requirements before adopting the fast variant. Route highly interactive, user-facing coding tasks to Composer 2 fast, and reserve the standard Composer 2 tier for asynchronous code generation workloads.

How to Optimize MoE Inference with Warp Decode

The Mixture-of-Experts Latency Problem

Rewriting the MoE Kernel

Integrating MXFP8 Compression

Blackwell Hardware Optimization

Benchmarks and Performance

Real-Time RL Training

Model Variants and Pricing

Keep Reading

NVIDIA Nemotron-Labs-Diffusion Yields 6x TPF Over Qwen3-8B

How to Find GPU Gaps in PyTorch 2.12 With torch.profiler

Async CUDA Streams Eliminate 25% GPU Wait in Transformers

How to Fine-Tune Qwen3 on AMD MI300X Using ROCm

Shrinking Model VRAM by 22% with Cloudflare Unweight