How to Optimize MoE Inference with Warp Decode
Learn how Cursor's warp decode technique uses GPU kernel optimizations and warp-level primitives to achieve 300+ tokens per second on Blackwell hardware.
Cursor’s new warp decode kernel optimization lets you run Mixture-of-Experts models at over 300 tokens per second on Blackwell hardware. It eliminates the memory bandwidth bottlenecks that typically slow down token routing in frontier coding models. This guide covers how the custom GPU kernels bypass standard memory hierarchies, the integration of MXFP8 compression, and the pricing structures for deploying these optimizations in production.
The Mixture-of-Experts Latency Problem
Running inference on Mixture-of-Experts models introduces severe memory bandwidth constraints. Each token passing through the model must be routed to specific feed-forward network layers called experts. This routing requires constant data movement between the global memory of the GPU and its registers.
Standard implementations process this movement using generalized CUDA libraries like cuBLAS. The reliance on these libraries creates latency overhead during the routing phase. Token generation slows down because the compute units sit idle waiting for data to traverse the memory hierarchy.
These architectural bottlenecks limit total tokens per second regardless of raw compute power. Anyone deploying MoE architectures must address this memory stall time to achieve frontier-level speeds.
Rewriting the MoE Kernel
Warp decode solves the routing bottleneck through a complete rewrite of the MoE layer at the GPU kernel level. The implementation removes all dependencies on standard CUDA libraries.
The custom kernels rely exclusively on warp-level primitives. A warp consists of a group of threads executing concurrently on the GPU. Warp decode allows these threads to exchange data directly with one another.
Direct warp communication bypasses both shared memory and global memory structures entirely. Tokens move through the expert routing phase without triggering memory stall time. The elimination of this data travel translates directly into sustained throughput during the forward pass.
Integrating MXFP8 Compression
MoE models often suffer from a performance degradation known as death by a thousand quantizations. This occurs when the inference engine constantly dequantizes data during the forward pass to perform calculations. The repeated format conversion consumes valuable processing cycles.
Warp decode prevents this by integrating Microscaled FP8 (MXFP8) kernels directly into the decoding loop. The model remains in its compressed format for much longer periods. Calculations happen closer to the native compressed state.
Bypassing continuous dequantization cycles preserves the speed gains generated by the warp-level communication. You can read more about standard quantization strategies to understand how data formats dictate inference speeds.
Blackwell Hardware Optimization
Traditional dequantization methods lose their efficiency on the newest generation of NVIDIA hardware. Hopper architectures handled standard memory movement well, but Blackwell introduces a different memory paradigm.
Blackwell GPUs feature a Tensor Memory (TMEM) architecture. The warp decode implementation is explicitly designed to leverage TMEM for MoE execution. Standard MegaBlocks implementations fail to utilize this specialized memory structure effectively.
Running warp decode on a B200 GPU executes the MoE layer forward pass 3.5x faster than a standard MegaBlocks setup. This kernel-level optimization is a necessary shift for teams scaling AI inference on Blackwell hardware.
Benchmarks and Performance
The combination of TMEM utilization, MXFP8 integration, and warp-level primitives results in a 1.5x to 2x end-to-end speedup in production environments.
The optimizations power the Composer 2 fast variant, pushing its generation speed past 300 tokens per second. This speed facilitates near-instantaneous code generation for complex programming tasks.
These inference improvements correlate with high benchmark scores. The Composer 2 implementation secured the number one rank on Terminal-Bench 2.0, an agent evaluation framework maintained by the Laude Institute. It also secured a top-three placement on SWE-bench Verified.
Real-Time RL Training
The underlying models receive continuous updates through a process called real-time RL. The training infrastructure collects billions of tokens from user interactions.
This interaction data feeds directly back into the training loop. The system ships an improved model checkpoint as frequently as every five hours. The warp decode kernels ensure that these rapid checkpoints load and execute efficiently in the production environment.
Model Variants and Pricing
Cursor exposes these optimizations through two specific API tiers for Composer 2. The standard model uses traditional routing, while the fast variant runs exclusively on the warp decode architecture.
| Model Variant | Input Cost (per 1M tokens) | Output Cost (per 1M tokens) |
|---|---|---|
| Composer 2 | $0.50 | $2.50 |
| Composer 2 fast | $1.50 | $7.50 |
The fast variant costs three times as much as the standard model. The price premium covers the specific B200 hardware requirements and the advanced kernel routing required to sustain 300 TPS.
Evaluate your application’s latency requirements before adopting the fast variant. Route highly interactive, user-facing coding tasks to Composer 2 fast, and reserve the standard Composer 2 tier for asynchronous code generation workloads.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
NVIDIA Nemotron 3 Super Redefines Agentic AI with Hybrid MoE
NVIDIA's new Nemotron 3 Super combines Mamba and Transformer architectures with a 1-million token context window to power high-speed autonomous reasoning.
TurboQuant Cuts LLM Memory Use by 6x Without Quality Loss
Google Research unveils TurboQuant, a compression suite delivering 8x faster inference and massive VRAM savings for long-context models like Llama-3.1.
ScaleOps Raises $130M to Automate AI Infrastructure
ScaleOps secures $130 million in Series C funding to scale its autonomous Kubernetes platform and optimize GPU resources for the AI era.
Meta’s KernelEvolve Agent Cuts AI Kernel Dev from Weeks to Hours
Meta introduces KernelEvolve, an agentic AI system that autonomously optimizes high-performance kernels, boosting ads model inference throughput by 60%.
Google AI Edge Eloquent brings free offline dictation to iOS
Google's new AI Edge Eloquent app uses Gemma 4 models to offer high-quality, offline-first transcription and text polishing for free on iPhone.