Cursor Composer 2.5 Hits 79.8% on SWE-bench Multilingual
Cursor released Composer 2.5, an agentic coding model utilizing targeted reinforcement learning to match Claude Opus 4.7 performance on sustained tasks.
Cursor has upgraded its proprietary AI coding model with the release of Composer 2.5, focusing on sustained, multi-step work across large codebases. The model uses targeted reinforcement learning to match the performance of frontier models like Claude Opus 4.7 and GPT-5.5 on long-horizon agentic tasks. Cursor directed 85% of its compute budget for this release toward extended training and reinforcement learning environments.
Architecture and Reinforcement Learning
Composer 2.5 continues to utilize the open-source Moonshot Kimi K2.5 checkpoint as its foundation. The training stack relies on Sharded Muon and Dual Mesh HSDP, which facilitate more efficient distributed training.
Cursor scaled the model’s intelligence by training it on 25 times more synthetic tasks than the previous generation. These tasks are heavily grounded in real-world codebases. The training process uses techniques like “feature deletion,” where the agent receives a codebase and its test suite, then must reimplement a deliberately removed feature to satisfy the existing tests. This provides a verifiable reward signal for the reinforcement learning pipeline.
The model also incorporates new learning methods utilizing textual feedback to tune behavioral aspects. This targeted RL improves the agent’s communication style and effort calibration, prioritizing usability traits that standard benchmarks often fail to capture.
To handle context scaling, Composer 2.5 implements a “compaction-in-the-loop” training process. The model learns to effectively summarize its own context, allowing it to navigate code trajectories that extend far beyond its physical context window.
Cursor concurrently disclosed a strategic collaboration with SpaceXAI to train a future, significantly larger model from scratch using Colossus 2’s million H100-equivalents and 10 times more total compute.
Benchmark Performance
Composer 2.5 reaches parity with top-tier models on complex software engineering evaluations, demonstrating its capacity for multi-file code modifications.
| Benchmark | Composer 2.5 Score |
|---|---|
| SWE-bench Multilingual | 79.8% |
| CursorBench v3.1 | 63.2% |
Pricing and Platform Updates
Composer 2.5 is available immediately within the Cursor IDE. Usage is divided into two distinct pricing tiers depending on latency requirements:
- Standard Tier: $0.50 per 1 million input tokens and $2.50 per 1 million output tokens.
- Fast Tier (Default): $3.00 per 1 million input tokens and $15.00 per 1 million output tokens.
The release ships alongside new platform capabilities designed for multi-step agent operations. A new Build in Parallel feature uses async subagents to multitask across independent segments of a coding plan, accelerating execution speed during complex refactors.
For development teams requiring controlled execution, Cursor added Development Environments for Cloud Agents. Users can configure Dockerfile-based environments with build secrets and layer caching, allowing agents to run end-to-end tasks in an isolated cloud setup. A new Microsoft Teams integration also enables developers to delegate tasks directly to cloud agents by mentioning @Cursor in chat threads.
If you manage complex applications, these updates shift the IDE agent’s role from localized code generation to persistent, repository-wide execution tasks.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
Cursor's Autoinstall Bootstraps RL Training Environments
Learn how Cursor uses previous model generations to automate reinforcement learning environment setups, mock dependencies, and verify target commands.
Scaling Ecom-RLVE for Verifiable AI Shopping Agents
The new Ecom-RLVE framework replaces subjective AI judging with algorithmic verification to train reliable e-commerce agents through adaptive RL environments.
RLHF Leak Forces OpenAI to Ban Goblin Metaphors in Codex
OpenAI hardcoded a ban on goblin metaphors in the GPT-5.5 Codex CLI after an unintended reinforcement learning generalization corrupted bug descriptions.
Claude Code Retrospective Details 5x Drop in Session Costs
Anthropic's new technical retrospective reveals that prompt caching and prefix compaction act as strict architectural constraints for complex agentic workflows.
Windsurf 2.0 Is Now a Multi-Agent Command Center
Codeium’s Windsurf 2.0 transforms the AI IDE into a multi-agent orchestrator, allowing developers to manage Devin and Cascade agents from a central hub.