Cursor Agents Boost CUDA Kernel Speed by 38% on NVIDIA Blackwell
A new multi-agent system from Cursor achieves massive performance gains on NVIDIA Blackwell GPUs by autonomously optimizing complex CUDA kernels.
Cursor and NVIDIA applied a specialized multi-agent system to write and optimize CUDA kernels for NVIDIA Blackwell 200 GPUs, achieving a 38% geometric mean speedup over existing baselines. Detailed in a joint research post published on April 14, 2026, the system operated autonomously for three weeks on 235 problems generated via SOL-ExecBench. If you write low-level GPU code for AI inference or training, this development shifts the economics of kernel optimization from months of manual tuning to weeks of automated profiling.
Performance Benchmarks
The evaluation targeted the long tail of unoptimized kernels that heavily bottleneck standard AI workloads. The multi-agent harness outperformed standard baselines on the majority of the test suite.
| Metric | Result |
|---|---|
| Overall Geomean Speedup | 38% |
| Baselines Outperformed | 149 of 235 problems (63%) |
| High-Impact Gains (>2x speedup) | 45 problems (19%) |
| Custom GEMM vs cuBLAS | 86% of manual cuBLAS performance |
The most notable single-task result involved generating a custom CUDA C++ General Matrix Multiply (GEMM) kernel entirely from scratch. The agent-generated code reached 86% of the performance of NVIDIA’s manually tuned cuBLAS library.
Autonomous Profiling on Blackwell
The architecture utilizes long-running multi-agent workflows where multiple agents coordinate, verify output, and iteratively profile code against direct hardware signals. This feedback loop allows the system to operate near the assembler level. The agents discover optimal memory access patterns and hardware-specific instruction scheduling tailored strictly for the Blackwell 200 architecture.
The collaboration successfully optimized highly specialized operations for MoE inference. Key targets included NVFP4 MoE Linear with Gating, BF16 Grouped Query Attention with Paged Prefill, and standard BF16 Matrix Multiplication. The system relies on repeated execution and validation rather than pure zero-shot generation.
The Multi-Agent Harness
Cursor originally built this multi-agent harness for complex software engineering tasks, previously using it to write a web browser from over one million lines of code and to migrate complex React and Solid codebases. The pivot to low-level CUDA optimization demonstrates the versatility of the underlying infrastructure.
The kernel optimization project follows the recent launch of Cursor 3, which introduced a unified workspace built around the new Composer 2 model variant. The joint research team included Wilson Lin from Cursor alongside Sahil Modi, Yuan Zhang, and Edward Lin from NVIDIA. They successfully compressed what typically requires months of expert manual tuning into a three-week autonomous run.
Teams building custom models or deploying bespoke inference pipelines frequently face a performance bottleneck when default kernels fail to scale on new hardware architectures. You should evaluate whether your latency constraints require dedicated CUDA C++ engineering or if a multi-agent profiling harness can automate the required tiling and scheduling optimizations for your specific target hardware.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Implement Multi-Agent Coordination Patterns
Learn five production-grade architectural patterns for multi-agent systems to optimize performance, hierarchy, and context management in AI engineering.
Claude Cowork Reimagines the Enterprise as an Agentic Workspace
Anthropic debuts Claude Cowork, introducing multi-agent coordination, persistent team memory, and VPC deployment options for secure corporate collaboration.
Rocket AI Automates Strategic Consulting Reports for Startups
Indian startup Rocket launches a multi-agent AI platform to deliver high-level McKinsey-style management consulting at a fraction of traditional costs.
Cursor Cloud Agents Can Now Run in Your Own Infrastructure
Cursor self-hosted cloud agents are now GA, letting teams run agent execution in their own infrastructure while Cursor handles orchestration.
Mistral AI Raises $830M for New Data Center Near Paris
Mistral AI has secured $830 million in debt financing to build a sovereign data center in France featuring 13,800 NVIDIA Blackwell GPUs.