What Is Continued Pretraining in AI?
Continued pretraining adapts a general LLM to a specific domain using large unlabeled data. How it works, how it differs from fine-tuning, and real examples.
When Meta wanted a coding model, they did not train one from scratch. They took Llama 2 and trained it further on 500 billion tokens of code. The result was Code Llama. When Cursor built Composer 2, they took Moonshot AI’s Kimi K2.5 and trained it further on software engineering data before applying reinforcement learning. Both used the same technique: continued pretraining.
Continued pretraining is the process of taking an already-trained language model and training it further on new data, usually from a specific domain. It is one of the most common steps in building specialized AI products, and it sits at a distinct point in the training pipeline that is often confused with fine-tuning.
The Three Stages of Model Training
To understand continued pretraining, it helps to see where it fits in the full pipeline.
Stage 1: Pretraining (from scratch)
This is where a model first learns language. The model starts with random weights and reads trillions of tokens from a broad dataset: books, websites, code, scientific papers, conversations. The training objective is simple: given a sequence of tokens, predict the next one.
After trillions of predictions and weight updates, the model develops a general understanding of language, facts, logic, code, and reasoning patterns. This is the most expensive stage. Training a frontier model from scratch costs tens or hundreds of millions of dollars in compute.
The output is a base model: a general-purpose text predictor that has absorbed broad knowledge but has not been trained for any specific task or behavior.
Stage 2: Continued pretraining (domain adaptation)
Continued pretraining takes the base model and trains it further on a large, domain-specific dataset. The training objective stays the same: predict the next token. But the data changes.
Instead of a broad internet-scale dataset, the model now reads large volumes of text from a specific domain: medical literature, legal documents, financial filings, source code, or whatever the target specialty is.
The model is not learning a new skill. It is deepening its knowledge in one area. The general language understanding from stage 1 is preserved, but the model develops much stronger representations of domain-specific terminology, patterns, and reasoning.
Stage 3: Fine-tuning (behavior alignment)
Fine-tuning takes the model (after pretraining, or after continued pretraining) and trains it on a smaller, labeled dataset to teach it how to behave. The data is structured as input-output pairs: questions and answers, instructions and completions, conversations with appropriate responses.
This is where the model learns to follow instructions, answer questions helpfully, refuse harmful requests, and generally act like an assistant rather than a raw text predictor.
Continued Pretraining vs. Fine-Tuning
These two steps are frequently confused because both involve “training an existing model on new data.” The difference is fundamental:
| Continued Pretraining | Fine-Tuning | |
|---|---|---|
| Purpose | Teach the model what to know | Teach the model how to behave |
| Data type | Large, raw, unlabeled text | Small, structured, labeled pairs |
| Data scale | Billions of tokens | Thousands to millions of examples |
| Training objective | Next-token prediction (same as original pretraining) | Supervised learning on input-output pairs |
| What changes | The model’s knowledge and representations | The model’s output patterns and style |
A useful analogy: pretraining is a general education. Continued pretraining is a specialized degree. Fine-tuning is job training. You need the degree (domain knowledge) before the job training (task behavior) has much to work with.
If you fine-tune a general model directly on medical question-answer pairs, it will learn to format medical answers, but it may lack the deep domain knowledge needed to get the content right. If you first do continued pretraining on medical literature and then fine-tune on medical QA pairs, the model has both the knowledge and the behavioral patterns.
Research bears this out. A 2024 study across health, chemistry, and coding domains found that the optimal allocation of training compute heavily favors continued pretraining: approximately 99.99% of the token budget should go to continued pretraining, with only a thin slice for fine-tuning. The continued pretraining provides the knowledge foundation; fine-tuning is a light shaping step on top.
Real-World Examples
Code Llama (Meta, 2023)
Meta took Llama 2 (a general-purpose LLM) and ran continued pretraining on 500 billion tokens of code and code-related data. For the 70B variant, they used 1 trillion tokens. The dataset included publicly available source code plus natural language discussions about code (Stack Overflow posts, GitHub issues, documentation).
The result: a model that understands code far better than the general Llama 2, while still retaining general language ability. Code Llama then went through additional stages (Python specialization, instruction tuning) to produce the final variants.
Cursor Composer 2 (Anysphere, 2026)
Cursor took Kimi K2.5, a general-purpose MoE model with 1 trillion parameters, and ran continued pretraining on software engineering data. Cursor describes this step as providing “a far stronger base to scale our reinforcement learning.” The continued pretraining gave the model deeper understanding of real-world codebases, programming patterns, and development workflows before RL training taught it to act as a coding agent.
Meditron (EPFL + Yale, 2024-2025)
Researchers took Llama 3.1 and ran continued pretraining on curated medical data: textbooks, filtered PubMed Central articles, and clinical practice guidelines. The resulting Llama-3-Meditron outperforms GPT-4 and MedPaLM-2 on medical benchmarks (MedMCQA, MedQA, PubMedQA), despite being a much smaller model. The domain-specific pretraining gave it medical knowledge that a general model simply does not have.
Llama3-SEC (2024)
A 70B Llama 3 model adapted to the financial domain through continued pretraining on SEC regulatory filings and financial documents. This model was combined with model merging techniques to preserve general capabilities while adding deep financial knowledge.
The Catastrophic Forgetting Problem
The central challenge of continued pretraining is catastrophic forgetting: when the model learns new domain knowledge, it can lose previously learned general knowledge. Train too aggressively on medical text and the model might become worse at general reasoning, coding, or casual conversation.
This is not a theoretical risk. It is the main failure mode of continued pretraining done poorly.
Three strategies address it:
Learning rate management
The learning rate controls how aggressively the model updates its weights. During continued pretraining, you typically start with a learning rate warmup (gradually increasing the rate from near-zero), then follow a decay schedule (gradually decreasing it). This prevents the initial training steps from making large, destabilizing changes to the model’s existing knowledge.
The warmup period is usually short (1-3% of total training steps). The peak learning rate is generally lower than the original pretraining rate, since the goal is adaptation rather than learning from scratch.
Recent research from EleutherAI found that the combination of learning rate re-warming and re-decaying can match the performance of training from scratch, at a fraction of the compute.
Data replay
During continued pretraining on domain-specific data, you mix in a percentage of the original pretraining data (or a representative subset). This replay gives the model periodic reminders of its general knowledge, preventing it from drifting too far toward the new domain.
The ratio matters. Too much replay wastes budget on data the model already knows. Too little allows forgetting. Common ratios range from 5% to 20% replay data. A 2025 study found that replay can actually improve performance on the target domain (not just prevent forgetting), improving data efficiency by up to 1.87x for fine-tuning and 2.06x for mid-training.
Data mixing during original pretraining
A more recent approach (Tirumala et al.) shows that mixing a small amount of domain-relevant data into the original pretraining builds resistance to forgetting during later continued pretraining. Even a few percent of domain data in the pretraining mix makes the model significantly more robust to adaptation. This is complementary to replay and learning rate strategies.
When To Use Continued Pretraining
Continued pretraining makes sense when:
- The target domain has specialized vocabulary, patterns, or knowledge that a general model lacks. Medicine, law, finance, and specific programming frameworks are common targets.
- You have large volumes of domain text (hundreds of millions to billions of tokens). If you only have a small labeled dataset, fine-tuning alone may be sufficient.
- You need the model to understand the domain, not just mimic its format. Fine-tuning can teach a model to format answers as a doctor would. Continued pretraining teaches it to reason about medical knowledge at depth.
Continued pretraining is typically not needed when:
- The general model already handles the domain well. If GPT-4 already answers your domain questions accurately, the base knowledge is probably sufficient and fine-tuning alone may work.
- You only have a small amount of domain data. Continued pretraining requires scale. With only a few thousand examples, fine-tuning or RAG are better approaches.
- Your use case is primarily about output format, not domain knowledge. Teaching a model to produce JSON output or respond in a specific style is a fine-tuning problem, not a continued pretraining problem.
The Practical Economics
Continued pretraining is significantly cheaper than pretraining from scratch, but it is not cheap. Code Llama’s 500 billion tokens of training, while a fraction of Llama 2’s original pretraining, still required substantial GPU time. Cursor’s continued pretraining run on Kimi K2.5 used thousands of NVIDIA GPUs.
For most teams, the economics work like this: continued pretraining costs 5-20% of what full pretraining would cost, but delivers most of the domain-specific gains. The key efficiency comes from not having to re-learn general language understanding, since the base model already has it.
If you are building on an existing API (not training your own model), continued pretraining is handled by the model provider. The decision for you is which model to pick: one that has already been adapted to your domain through continued pretraining, or a general-purpose model that you enhance with RAG or fine-tuning at the application layer.
Key Takeaways
- Continued pretraining teaches a model what to know. Fine-tuning teaches it how to behave. They solve different problems and are usually done in sequence.
- The training objective is the same as original pretraining: next-token prediction. The difference is the data, which is narrower and domain-specific.
- Catastrophic forgetting is the main risk. Learning rate management, data replay, and early data mixing are the standard mitigations.
- Scale matters. Continued pretraining needs hundreds of millions to billions of tokens of domain data. For smaller datasets, fine-tuning or RAG is the better tool.
- Most specialized models you use were built this way. Code Llama, Cursor’s Composer 2, medical models like Meditron: all started from a general base and went through continued pretraining before anything else.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
IBM Granite Releases Mellea 0.4.0 Libraries
IBM Granite announced Mellea 0.4.0 and three LoRA-based libraries for RAG, validation, and safety on granite-4.0-micro.
Continued Pretraining vs RAG: Two Ways to Add Knowledge
Continued pretraining bakes knowledge into model weights. RAG injects it at query time. When to use each, where each breaks down, and why you often need both.
Fine-Tuning vs RAG: When to Use Each Approach
RAG changes what the model knows. Fine-tuning changes how it behaves. Here's when to use each approach, their real tradeoffs, and why the answer is usually both.
How to Build a Domain-Specific Embedding Model
Learn NVIDIA's recipe for fine-tuning a domain-specific embedding model in hours using synthetic data, hard negatives, BEIR, and NIM.
How Cursor Built Composer 2 on Top of Kimi K2.5
Cursor's Composer 2 is built on Kimi K2.5. What continued pretraining, reinforcement learning, and self-summarization mean, and how they work.