Continued Pretraining vs RAG: Two Ways to Add Knowledge
Continued pretraining bakes knowledge into model weights. RAG injects it at query time. When to use each, where each breaks down, and why you often need both.
Fine-tuning and RAG solve different problems: fine-tuning changes how a model behaves, RAG changes what it has access to. But there is a more specific comparison worth understanding: continued pretraining vs. RAG. Both are ways to give a model domain knowledge it does not have. They just store that knowledge in fundamentally different places.
Continued pretraining bakes knowledge into the model’s parameters. RAG retrieves knowledge from an external source and injects it into the prompt at query time. One is permanent and deep. The other is flexible and updateable. The tradeoffs between them shape how production AI systems are built.
Parametric vs. Non-Parametric Knowledge
The core distinction is where the knowledge lives.
Parametric knowledge is stored in the model’s weights. During continued pretraining, the model reads billions of tokens of domain-specific text and adjusts its parameters through next-token prediction. The patterns, terminology, and reasoning structures of the domain become part of the model itself. You do not need to provide this information at query time. The model “just knows” it, the same way it knows that Paris is the capital of France from its original training.
Non-parametric knowledge lives outside the model. In a RAG system, you maintain a knowledge base (documents, databases, APIs), and a retrieval pipeline fetches the relevant pieces before each query. The model receives this context alongside the user’s question and generates from it. The model’s weights are unchanged. Its “knowledge” for any given query depends entirely on what the retrieval system finds.
This is not a small distinction. It affects latency, cost, accuracy, freshness, and how the model reasons.
Where Continued Pretraining Wins
Deep domain reasoning
When a model has been pretrained on medical literature, it does not just know medical facts. It has internalized the patterns of how medical reasoning works: differential diagnosis, the relationship between symptoms and conditions, the structure of clinical decision-making. This is knowledge that is difficult to provide through retrieved chunks.
RAG can supply a relevant passage from a medical textbook. Continued pretraining makes the model think in the domain’s language. The difference shows up most clearly in complex queries that require synthesis across multiple concepts, where retrieved chunks provide fragments but the model needs deep understanding to connect them.
Research from 2025 confirms this: for multi-hop questions requiring reasoning across multiple facts, parametric knowledge from continued pretraining enables more reliable synthesis than retrieved context alone.
Latency-sensitive applications
RAG adds a retrieval step to every query. You embed the query, search the vector database, fetch documents, and stuff them into the prompt before the model generates. This adds 100-800ms depending on the retrieval system.
A model with domain knowledge baked in through continued pretraining requires no retrieval step. The query goes directly to the model and the model generates from its parameters. For applications where latency matters (real-time coding assistance, interactive agents, live conversations), this difference is significant.
This is why Cursor built Composer 2 through continued pretraining on software engineering data rather than using RAG. A coding agent that adds 500ms of retrieval to every tool call would break the interactive experience.
High-volume cost efficiency
RAG has a per-query cost: embedding the query, running the vector search, and processing longer prompts (since retrieved context inflates the input). At low volume, this is cheaper than the upfront cost of continued pretraining. At high volume, the math inverts.
If you serve millions of queries per day, the accumulated retrieval and embedding costs add up. A model that has already absorbed the domain knowledge through continued pretraining processes shorter prompts (no retrieved context) and skips the retrieval infrastructure entirely. For Cursor, whose models handle millions of daily code completions and agent actions, parametric knowledge makes the per-query economics work.
Where RAG Wins
Frequently changing information
Continued pretraining produces a snapshot. The model learns what was true at training time. If your domain knowledge changes (new product documentation, updated regulations, recent research), the model’s parametric knowledge becomes stale. Updating it means running continued pretraining again, which costs significant compute.
RAG handles this naturally. Update the documents in your knowledge base, re-index, and the next query reflects the new information. No retraining needed. For any domain where the facts change weekly, monthly, or even quarterly, RAG is the only practical choice.
Traceability and citations
When a model generates from its parametric knowledge, you cannot trace a specific claim back to a specific source. The knowledge is distributed across millions of parameters, blended with everything else the model learned.
RAG provides a clear chain: this answer was generated using these specific retrieved documents. You can show users the source passages. You can verify the model did not hallucinate. For legal, medical, financial, or compliance-heavy applications where you need to prove where an answer came from, retrieval-based systems are the only option.
Reducing hallucination
Models hallucinate less when they have relevant context in the prompt. RAG systems with good retrieval consistently produce lower hallucination rates than models relying purely on parametric knowledge, even parametric knowledge from continued pretraining. Research shows RAG can reduce hallucination rates from ~12% to ~2% compared to non-retrieval baselines.
The caveat: this depends on retrieval quality. Bad retrieval (wrong documents, incomplete chunks, stale data) can actually make hallucination worse. The improvement is real when the retrieval pipeline works well.
Limited domain data
Continued pretraining requires hundreds of millions to billions of tokens of domain text. If you have a small corpus (a few hundred documents, a product manual, a niche knowledge base), there is not enough data to meaningfully shift the model’s parametric knowledge.
RAG works at any scale. Even a single document can be indexed and retrieved against. You do not need massive data volume for the approach to add value. For smaller knowledge bases, RAG is the practical option simply because there is not enough data to justify continued pretraining.
The Failure Modes
When continued pretraining fails
Outdated knowledge. The model confidently asserts facts that were true at training time but have since changed. There is no mechanism to update individual facts without retraining.
Knowledge boundaries. The model does not know what it does not know. If a query falls outside the domain it was pretrained on, it may still generate a confident-sounding answer from its general knowledge, without signaling that it is outside its specialty.
Cost of iteration. Running continued pretraining is expensive (thousands to tens of thousands of dollars for a meaningful run). If the domain is evolving rapidly, the cost of repeated pretraining becomes prohibitive.
When RAG fails
Multi-hop reasoning. The query requires connecting information from multiple documents. Retrieval returns individual chunks, but the model needs to synthesize across them. If the connection between facts is not explicit in any single chunk, the model may fail to make the connection. Research shows RAG struggles with 15-40% of multi-hop queries.
Retrieval quality ceiling. RAG is only as good as retrieval. If the knowledge base has gaps, or the chunking strategy splits relevant information across chunks, or the embedding model does not capture semantic similarity well, the model gets wrong or partial context and produces wrong answers. No amount of model quality can fix bad retrieval.
Context window limits. Retrieval works by stuffing relevant documents into the prompt. For queries that require broad context (understanding an entire codebase, reasoning about a complex regulatory framework), you may need more context than fits in the model’s window. Continued pretraining encodes this knowledge in parameters, which have no per-query size limit.
The Decision Framework
| Factor | Continued Pretraining | RAG |
|---|---|---|
| Knowledge changes | Rarely or never | Frequently |
| Data volume | Billions of tokens available | Any size |
| Latency requirement | Low latency critical | 100-800ms acceptable |
| Query volume | Millions/day | Moderate |
| Traceability needed | No | Yes |
| Reasoning depth | Complex multi-concept synthesis | Fact lookup and simple reasoning |
| Upfront cost | $10,000-$100,000+ | $1,000-$5,000 |
| Per-query cost | Lower (no retrieval) | Higher (embedding + retrieval + longer prompt) |
| Update cycle | Weeks to months | Minutes to hours |
Using Both
The strongest production systems combine continued pretraining and RAG. This is the pattern behind Cursor’s Composer 2 and most serious domain-specific AI applications.
Continued pretraining gives the model deep domain understanding: the vocabulary, the reasoning patterns, the relationships between concepts. This is the foundation. It means the model can reason about the domain even when no context is retrieved.
RAG gives the model access to specific, current, verifiable facts on top of that foundation. A domain-pretrained model is better at using retrieved context because it already understands the domain. It is more likely to extract the right information from retrieved chunks, less likely to misinterpret domain terminology, and better at synthesizing retrieved information with its existing knowledge.
The hybrid is not just additive. A continued-pretrained model with RAG outperforms either approach alone because the parametric knowledge makes the retrieval more effective.
Practical Guidance
Start with RAG if you have a knowledge base and need answers from it. RAG is cheaper, faster to set up, and easier to iterate on. Most domain applications should start here.
Add continued pretraining if the model’s domain understanding is the bottleneck, not its access to facts. If you find that the model misinterprets domain terminology, fails to reason about domain-specific concepts, or produces shallow answers despite having good retrieved context, the model lacks domain depth. Continued pretraining addresses that.
Use continued pretraining alone when latency and per-query cost are dominant constraints and the domain knowledge is stable. Coding agents, real-time recommendation systems, and high-throughput classification are common examples.
Use RAG alone when facts change frequently, traceability is required, or the domain data is too small for continued pretraining. Internal documentation systems, customer support, and compliance tools are typical RAG-first applications.
The right answer depends on your constraints. But understanding the difference between knowledge in the weights and knowledge in the context is the foundation for making the right architectural decision.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
IBM Granite Releases Mellea 0.4.0 Libraries
IBM Granite announced Mellea 0.4.0 and three LoRA-based libraries for RAG, validation, and safety on granite-4.0-micro.
What Is Continued Pretraining in AI?
Continued pretraining adapts a general LLM to a specific domain using large unlabeled data. How it works, how it differs from fine-tuning, and real examples.
How to Build a RAG Application (Step by Step)
A practical walkthrough of building a RAG pipeline from scratch: chunking documents, generating embeddings, storing vectors, retrieving context, and generating grounded answers.
Fine-Tuning vs RAG: When to Use Each Approach
RAG changes what the model knows. Fine-tuning changes how it behaves. Here's when to use each approach, their real tradeoffs, and why the answer is usually both.
How to Choose a Vector Database in 2026
Pinecone, Weaviate, Qdrant, pgvector, or Chroma? Here's how to pick the right vector database for your AI application based on scale, infrastructure, and actual needs.