Ai Agents 3 min read

Cursor Agents Boost CUDA Kernel Speed by 38% on NVIDIA Blackwell

A new multi-agent system from Cursor achieves massive performance gains on NVIDIA Blackwell GPUs by autonomously optimizing complex CUDA kernels.

Cursor and NVIDIA applied a specialized multi-agent system to write and optimize CUDA kernels for NVIDIA Blackwell 200 GPUs, achieving a 38% geometric mean speedup over existing baselines. Detailed in a joint research post published on April 14, 2026, the system operated autonomously for three weeks on 235 problems generated via SOL-ExecBench. If you write low-level GPU code for AI inference or training, this development shifts the economics of kernel optimization from months of manual tuning to weeks of automated profiling.

Performance Benchmarks

The evaluation targeted the long tail of unoptimized kernels that heavily bottleneck standard AI workloads. The multi-agent harness outperformed standard baselines on the majority of the test suite.

MetricResult
Overall Geomean Speedup38%
Baselines Outperformed149 of 235 problems (63%)
High-Impact Gains (>2x speedup)45 problems (19%)
Custom GEMM vs cuBLAS86% of manual cuBLAS performance

The most notable single-task result involved generating a custom CUDA C++ General Matrix Multiply (GEMM) kernel entirely from scratch. The agent-generated code reached 86% of the performance of NVIDIA’s manually tuned cuBLAS library.

Autonomous Profiling on Blackwell

The architecture utilizes long-running multi-agent workflows where multiple agents coordinate, verify output, and iteratively profile code against direct hardware signals. This feedback loop allows the system to operate near the assembler level. The agents discover optimal memory access patterns and hardware-specific instruction scheduling tailored strictly for the Blackwell 200 architecture.

The collaboration successfully optimized highly specialized operations for MoE inference. Key targets included NVFP4 MoE Linear with Gating, BF16 Grouped Query Attention with Paged Prefill, and standard BF16 Matrix Multiplication. The system relies on repeated execution and validation rather than pure zero-shot generation.

The Multi-Agent Harness

Cursor originally built this multi-agent harness for complex software engineering tasks, previously using it to write a web browser from over one million lines of code and to migrate complex React and Solid codebases. The pivot to low-level CUDA optimization demonstrates the versatility of the underlying infrastructure.

The kernel optimization project follows the recent launch of Cursor 3, which introduced a unified workspace built around the new Composer 2 model variant. The joint research team included Wilson Lin from Cursor alongside Sahil Modi, Yuan Zhang, and Edward Lin from NVIDIA. They successfully compressed what typically requires months of expert manual tuning into a three-week autonomous run.

Teams building custom models or deploying bespoke inference pipelines frequently face a performance bottleneck when default kernels fail to scale on new hardware architectures. You should evaluate whether your latency constraints require dedicated CUDA C++ engineering or if a multi-agent profiling harness can automate the required tiling and scheduling optimizations for your specific target hardware.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading