How Cursor Built Composer 2 on Top of Kimi K2.5

Cursor’s latest coding model, Composer 2, was built on top of Kimi K2.5, an open-source model from Beijing-based Moonshot AI. The connection was discovered by a developer who found the internal model identifier kimi-k2p5-rl-0317-s515-fast, and confirmed by Cursor shortly after.

The interesting part is not the controversy around attribution. It is the training pipeline. Going from a general-purpose open-source model to a specialized coding agent involves a specific sequence of techniques, and Cursor has published enough detail across their blog posts to piece together what that sequence looks like.

This post breaks down each stage of that pipeline: what the base model is, what Cursor did to it, and what all of the relevant terms actually mean.

The Starting Point: Kimi K2.5

Every trained model starts from something. In Cursor’s case, that something is Kimi K2.5, released by Moonshot AI on January 27, 2026.

Kimi K2.5 has 1 trillion total parameters, but it does not use all of them at once. It is built on a Mixture-of-Experts (MoE) architecture, which is worth understanding because it explains why this model can be both massive and affordable to run.

Mixture-of-Experts, explained

In a standard (dense) language model, every parameter participates in processing every token. If you have a 70-billion-parameter model, all 70 billion parameters activate for every word.

An MoE model works differently. Instead of one large feed-forward network, it has many smaller ones called experts. Kimi K2.5 has 384 of them. For each token the model processes, a router (a small neural network) looks at the input and picks which experts should handle it. Only 8 experts activate per token, plus one shared expert that always runs.

The result: 1 trillion parameters worth of stored knowledge, but only about 32 billion doing work at any given moment. You get the capacity of a huge model at the compute cost of a much smaller one.

This is not unique to Kimi. DeepSeek V3, Mixtral, and several other recent models use MoE. It has become the dominant architecture for building large, cost-efficient models.

The license

Kimi K2.5 is released under a Modified MIT License. Standard MIT lets you do almost anything. The modification adds one requirement: if your product exceeds 100 million monthly active users or $20 million in monthly revenue, you must display “Kimi K2.5” in the interface. Cursor’s parent company Anysphere has an ARR exceeding $2 billion, which puts it well above that threshold.

Stage 1: Continued Pretraining

The first thing Cursor did was continued pretraining. This is the least intuitive of the training stages because it sounds like it should be the same as the original training, but it is a distinct step with a different purpose.

What pretraining is

When a model like Kimi K2.5 is originally trained, it reads trillions of tokens of text from across the internet: books, websites, code, conversations, scientific papers. The training objective is simple: predict the next token. Given “The cat sat on the”, predict “mat”. Do this across trillions of examples and the model develops a general understanding of language, facts, reasoning patterns, and code.

This original pretraining is enormously expensive. Kimi K2.5 was pretrained on approximately 15 trillion tokens.

What continued pretraining adds

Continued pretraining takes the already-trained model and trains it further, usually on a narrower, domain-specific dataset. The learning objective is the same (predict the next token), but the data is different.

For Cursor, this means feeding the model large volumes of source code, software engineering documentation, commit histories, bug reports, and other programming-related text. The model is not learning a new skill. It is deepening its existing understanding of one specific domain.

Think of original pretraining as a broad university education. Continued pretraining is a specialized residency. The model already knows how language and code work in general. Now it learns the specific patterns, idioms, and structures of real-world software engineering in much greater depth.

This is different from fine-tuning. Fine-tuning uses smaller, labeled datasets (question-answer pairs, instruction-response examples) to teach a model how to behave. Continued pretraining uses large, unlabeled datasets to teach a model what to know. Cursor’s blog describes this step as providing “a far stronger base to scale our reinforcement learning.”

Stage 2: Reinforcement Learning

After continued pretraining, Cursor trains the model using reinforcement learning (RL). This is where the model goes from knowing about code to being good at writing and editing code as an agent.

How RL differs from other training

In supervised training, you show the model examples: “Given this input, produce this output.” The model learns by imitation.

In reinforcement learning, the model learns by trial and error. You give it a task, let it attempt a solution, and then give it a reward signal: a score that tells it how well it did. Over thousands of attempts, the model learns which strategies lead to higher rewards and adjusts its behavior accordingly.

How Cursor applies RL

Cursor sets up sandboxed coding environments, hundreds of thousands of them running concurrently. Each environment contains a real codebase and a real software engineering problem. The model gets access to the same tools it will have in production:

File editing (reading and writing code)
Semantic search (finding relevant code across a codebase)
Grep (searching for strings)
Terminal commands (running tests, installing packages, checking output)

The model attempts to solve the problem using these tools. If the solution works (tests pass, the code compiles, the edit is correct), it gets a high reward. If it fails, low reward. The model’s weights update to favor the strategies that led to success.

The infrastructure for these sandboxed environments is the same system that powers Cursor’s Background Agents. Training happens in the exact same kind of environment where the model will eventually run. This matters because the model learns tool-use patterns that directly transfer to production.

Emergent behaviors

One of the more interesting findings from Cursor’s RL training: the model learns useful behaviors that were never explicitly taught. During training, Composer started performing complex multi-step searches, fixing linter errors after edits, and writing unit tests to verify its own changes. These behaviors emerged because they consistently led to higher rewards, not because anyone programmed them in.

This is a key property of RL. You define what success looks like (the reward), and the model discovers how to get there on its own.

Stage 3: Self-Summarization

Real coding tasks are long. A debugging session might involve reading dozens of files, running tests, forming hypotheses, and iterating. The total context can easily exceed a model’s context window (the maximum amount of text it can consider at once).

Most agent systems handle this by compacting the context when it gets too long: either by running a separate summarization model or by sliding a window that drops older context. Both approaches lose information.

Cursor’s approach is different. They train the model to summarize itself.

How self-summarization works

As Composer works through a task and approaches a fixed token-length trigger, it pauses. The system inserts a prompt asking the model to summarize its current context. The model writes a compressed version of everything that has happened so far: what it has tried, what worked, what failed, what the current plan is. Then it continues from that summary.

The difference from standard summarization: this behavior is part of the RL training loop. The model’s self-summaries directly affect its reward. If a summary preserves the right information and the model goes on to solve the problem, both the solution and the summary get reinforced. If the summary drops something critical and the model fails, both get penalized.

Over training, the model learns exactly what information matters enough to keep. Cursor reports that trained self-summaries average about 1,000 tokens, compared to 5,000+ tokens from a traditional prompted summarization approach, while reducing compaction errors by 50%.

Self-Summarization in Practice

In a Terminal-Bench 2.0 challenge called “make-doom-for-mips” (cross-compile Doom for a MIPS virtual machine), an early Composer checkpoint solved the problem over 170 turns. Along the way, it self-summarized more than 100,000 tokens of context down to roughly 1,000 tokens that captured the essential state needed to keep working.

That is the value of training summarization into the model rather than bolting it on as an external step. The model itself decides what to remember.

The Compute Split

According to Cursor, approximately one quarter of the total compute spent on Composer 2 came from the Kimi K2.5 base model. The remaining three quarters came from their own continued pretraining and RL training.

This ratio matters for understanding what Cursor actually built. The base model provides the foundational language understanding, code knowledge, and reasoning ability. Cursor’s training on top of that provides the specialization: the tool-use patterns, the coding agent behavior, the self-summarization capability, and the speed optimizations that make it fast enough for interactive use.

The training infrastructure runs on thousands of NVIDIA GPUs using custom PyTorch and Ray pipelines with native MXFP8 precision (a low-precision number format that speeds up both training and inference). Fireworks AI provides the inference hosting and also offers full-parameter RL tuning for Kimi K2.5 on its platform.

What Changed in the Benchmarks

The training pipeline produces measurable results:

Model	CursorBench	Terminal-Bench 2.0	SWE-bench Multilingual
Composer 2	61.3	61.7	73.7
Composer 1.5	44.2	47.9	65.9
Composer 1	38.0	40.0	56.9

CursorBench uses real agent requests from engineers at Cursor with hand-curated optimal solutions. Terminal-Bench 2.0 is an external benchmark maintained by the Laude Institute. The jump from Composer 1 to Composer 2 is substantial across all three, and Cursor attributes it primarily to the continued pretraining step providing a stronger base for RL.

The Bigger Pattern

Cursor’s approach is not unique. It is becoming the standard recipe for building specialized AI products:

Start with a strong open-source base model (Kimi K2.5, Llama, Qwen, DeepSeek)
Run continued pretraining on domain-specific data
Apply reinforcement learning in realistic environments
Add product-specific techniques (self-summarization, tool integration, speed optimization)

This is the same pattern behind most AI coding assistants and increasingly behind agent frameworks as well. The rise of open-source models from Chinese labs has made strong base models freely available, shifting the competitive advantage from “who has the best base model” to “who does the best domain-specific training and product integration.”

The base model is the foundation. The continued pretraining is the specialization. The RL is the behavior. And the product decisions around tooling, speed, and workflow integration are what the user actually experiences.

Understanding this pipeline makes it easier to evaluate any AI product’s claims about its model. When a company says they built a “proprietary” model, the question to ask is: proprietary from what starting point, and what was done on top of it?