Ai Engineering 9 min read

What Is Continued Pretraining in AI?

Continued pretraining adapts a general LLM to a specific domain using large unlabeled data. How it works, how it differs from fine-tuning, and real examples.

When Meta wanted a coding model, they did not train one from scratch. They took Llama 2 and trained it further on 500 billion tokens of code. The result was Code Llama. When Cursor built Composer 2, they took Moonshot AI’s Kimi K2.5 and trained it further on software engineering data before applying reinforcement learning. Both used the same technique: continued pretraining.

Continued pretraining is the process of taking an already-trained language model and training it further on new data, usually from a specific domain. It is one of the most common steps in building specialized AI products, and it sits at a distinct point in the training pipeline that is often confused with fine-tuning.

The Three Stages of Model Training

To understand continued pretraining, it helps to see where it fits in the full pipeline.

Stage 1: Pretraining (from scratch)

This is where a model first learns language. The model starts with random weights and reads trillions of tokens from a broad dataset: books, websites, code, scientific papers, conversations. The training objective is simple: given a sequence of tokens, predict the next one.

After trillions of predictions and weight updates, the model develops a general understanding of language, facts, logic, code, and reasoning patterns. This is the most expensive stage. Training a frontier model from scratch costs tens or hundreds of millions of dollars in compute.

The output is a base model: a general-purpose text predictor that has absorbed broad knowledge but has not been trained for any specific task or behavior.

Stage 2: Continued pretraining (domain adaptation)

Continued pretraining takes the base model and trains it further on a large, domain-specific dataset. The training objective stays the same: predict the next token. But the data changes.

Instead of a broad internet-scale dataset, the model now reads large volumes of text from a specific domain: medical literature, legal documents, financial filings, source code, or whatever the target specialty is.

The model is not learning a new skill. It is deepening its knowledge in one area. The general language understanding from stage 1 is preserved, but the model develops much stronger representations of domain-specific terminology, patterns, and reasoning.

Stage 3: Fine-tuning (behavior alignment)

Fine-tuning takes the model (after pretraining, or after continued pretraining) and trains it on a smaller, labeled dataset to teach it how to behave. The data is structured as input-output pairs: questions and answers, instructions and completions, conversations with appropriate responses.

This is where the model learns to follow instructions, answer questions helpfully, refuse harmful requests, and generally act like an assistant rather than a raw text predictor.

Continued Pretraining vs. Fine-Tuning

These two steps are frequently confused because both involve “training an existing model on new data.” The difference is fundamental:

Continued PretrainingFine-Tuning
PurposeTeach the model what to knowTeach the model how to behave
Data typeLarge, raw, unlabeled textSmall, structured, labeled pairs
Data scaleBillions of tokensThousands to millions of examples
Training objectiveNext-token prediction (same as original pretraining)Supervised learning on input-output pairs
What changesThe model’s knowledge and representationsThe model’s output patterns and style

A useful analogy: pretraining is a general education. Continued pretraining is a specialized degree. Fine-tuning is job training. You need the degree (domain knowledge) before the job training (task behavior) has much to work with.

If you fine-tune a general model directly on medical question-answer pairs, it will learn to format medical answers, but it may lack the deep domain knowledge needed to get the content right. If you first do continued pretraining on medical literature and then fine-tune on medical QA pairs, the model has both the knowledge and the behavioral patterns.

Research bears this out. A 2024 study across health, chemistry, and coding domains found that the optimal allocation of training compute heavily favors continued pretraining: approximately 99.99% of the token budget should go to continued pretraining, with only a thin slice for fine-tuning. The continued pretraining provides the knowledge foundation; fine-tuning is a light shaping step on top.

Real-World Examples

Code Llama (Meta, 2023)

Meta took Llama 2 (a general-purpose LLM) and ran continued pretraining on 500 billion tokens of code and code-related data. For the 70B variant, they used 1 trillion tokens. The dataset included publicly available source code plus natural language discussions about code (Stack Overflow posts, GitHub issues, documentation).

The result: a model that understands code far better than the general Llama 2, while still retaining general language ability. Code Llama then went through additional stages (Python specialization, instruction tuning) to produce the final variants.

Cursor Composer 2 (Anysphere, 2026)

Cursor took Kimi K2.5, a general-purpose MoE model with 1 trillion parameters, and ran continued pretraining on software engineering data. Cursor describes this step as providing “a far stronger base to scale our reinforcement learning.” The continued pretraining gave the model deeper understanding of real-world codebases, programming patterns, and development workflows before RL training taught it to act as a coding agent.

Meditron (EPFL + Yale, 2024-2025)

Researchers took Llama 3.1 and ran continued pretraining on curated medical data: textbooks, filtered PubMed Central articles, and clinical practice guidelines. The resulting Llama-3-Meditron outperforms GPT-4 and MedPaLM-2 on medical benchmarks (MedMCQA, MedQA, PubMedQA), despite being a much smaller model. The domain-specific pretraining gave it medical knowledge that a general model simply does not have.

Llama3-SEC (2024)

A 70B Llama 3 model adapted to the financial domain through continued pretraining on SEC regulatory filings and financial documents. This model was combined with model merging techniques to preserve general capabilities while adding deep financial knowledge.

The Catastrophic Forgetting Problem

The central challenge of continued pretraining is catastrophic forgetting: when the model learns new domain knowledge, it can lose previously learned general knowledge. Train too aggressively on medical text and the model might become worse at general reasoning, coding, or casual conversation.

This is not a theoretical risk. It is the main failure mode of continued pretraining done poorly.

Three strategies address it:

Learning rate management

The learning rate controls how aggressively the model updates its weights. During continued pretraining, you typically start with a learning rate warmup (gradually increasing the rate from near-zero), then follow a decay schedule (gradually decreasing it). This prevents the initial training steps from making large, destabilizing changes to the model’s existing knowledge.

The warmup period is usually short (1-3% of total training steps). The peak learning rate is generally lower than the original pretraining rate, since the goal is adaptation rather than learning from scratch.

Recent research from EleutherAI found that the combination of learning rate re-warming and re-decaying can match the performance of training from scratch, at a fraction of the compute.

Data replay

During continued pretraining on domain-specific data, you mix in a percentage of the original pretraining data (or a representative subset). This replay gives the model periodic reminders of its general knowledge, preventing it from drifting too far toward the new domain.

The ratio matters. Too much replay wastes budget on data the model already knows. Too little allows forgetting. Common ratios range from 5% to 20% replay data. A 2025 study found that replay can actually improve performance on the target domain (not just prevent forgetting), improving data efficiency by up to 1.87x for fine-tuning and 2.06x for mid-training.

Data mixing during original pretraining

A more recent approach (Tirumala et al.) shows that mixing a small amount of domain-relevant data into the original pretraining builds resistance to forgetting during later continued pretraining. Even a few percent of domain data in the pretraining mix makes the model significantly more robust to adaptation. This is complementary to replay and learning rate strategies.

When To Use Continued Pretraining

Continued pretraining makes sense when:

  • The target domain has specialized vocabulary, patterns, or knowledge that a general model lacks. Medicine, law, finance, and specific programming frameworks are common targets.
  • You have large volumes of domain text (hundreds of millions to billions of tokens). If you only have a small labeled dataset, fine-tuning alone may be sufficient.
  • You need the model to understand the domain, not just mimic its format. Fine-tuning can teach a model to format answers as a doctor would. Continued pretraining teaches it to reason about medical knowledge at depth.

Continued pretraining is typically not needed when:

  • The general model already handles the domain well. If GPT-4 already answers your domain questions accurately, the base knowledge is probably sufficient and fine-tuning alone may work.
  • You only have a small amount of domain data. Continued pretraining requires scale. With only a few thousand examples, fine-tuning or RAG are better approaches.
  • Your use case is primarily about output format, not domain knowledge. Teaching a model to produce JSON output or respond in a specific style is a fine-tuning problem, not a continued pretraining problem.

The Practical Economics

Continued pretraining is significantly cheaper than pretraining from scratch, but it is not cheap. Code Llama’s 500 billion tokens of training, while a fraction of Llama 2’s original pretraining, still required substantial GPU time. Cursor’s continued pretraining run on Kimi K2.5 used thousands of NVIDIA GPUs.

For most teams, the economics work like this: continued pretraining costs 5-20% of what full pretraining would cost, but delivers most of the domain-specific gains. The key efficiency comes from not having to re-learn general language understanding, since the base model already has it.

If you are building on an existing API (not training your own model), continued pretraining is handled by the model provider. The decision for you is which model to pick: one that has already been adapted to your domain through continued pretraining, or a general-purpose model that you enhance with RAG or fine-tuning at the application layer.

Key Takeaways

  • Continued pretraining teaches a model what to know. Fine-tuning teaches it how to behave. They solve different problems and are usually done in sequence.
  • The training objective is the same as original pretraining: next-token prediction. The difference is the data, which is narrower and domain-specific.
  • Catastrophic forgetting is the main risk. Learning rate management, data replay, and early data mixing are the standard mitigations.
  • Scale matters. Continued pretraining needs hundreds of millions to billions of tokens of domain data. For smaller datasets, fine-tuning or RAG is the better tool.
  • Most specialized models you use were built this way. Code Llama, Cursor’s Composer 2, medical models like Meditron: all started from a general base and went through continued pretraining before anything else.
Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading