What Is RAG? Retrieval-Augmented Generation Explained

Large language models know a lot, but they don’t know your data. They can’t access your company’s internal docs, your latest product specs, or the database you updated five minutes ago. They generate from patterns learned during training. If the answer isn’t in those patterns, the model either guesses or makes something up.

Retrieval-Augmented Generation (RAG) fixes this by adding a retrieval step before generation. Instead of asking the model to answer from memory, you first search a knowledge base for relevant information, then pass that information to the model along with your question. The model generates its response grounded in real, current data.

The Three Stages of RAG

1. Indexing

You take your documents (PDFs, web pages, database records, internal wikis) and split them into chunks. Each chunk gets converted into an embedding: a numerical vector that captures its semantic meaning. These embeddings are stored in a vector database.

This is where most RAG pipelines succeed or fail, and most people don’t realize it. The chunking strategy matters enormously. Split a document every 500 tokens and you might cut a paragraph in half, destroying the context the model needs. Split on section boundaries and you might get chunks that are too large, diluting the signal when the model tries to find the relevant part.

The best chunking strategies are document-aware. A markdown file should chunk at headings. A codebase should chunk at function or class boundaries. A legal document should chunk at clause boundaries. Generic splitting produces generic results.

Overlap between chunks helps. If chunk 1 ends mid-thought and chunk 2 picks it up, the overlap ensures both chunks contain the complete thought. 10-20% overlap is a reasonable starting point.

2. Retrieval

When a user asks a question, that question also gets embedded. The system searches the vector database for chunks whose embeddings are semantically closest to the question.

This is fundamentally different from keyword search. A question about “employee vacation policy” matches a chunk about “PTO guidelines” even if those exact words never appear. The embedding model learned during training that these concepts are related.

The standard similarity metric is cosine similarity: the cosine of the angle between two vectors in high-dimensional space. Vectors pointing in the same direction (similar meaning) have a cosine similarity near 1. Orthogonal vectors (unrelated meaning) score near 0.

Most RAG systems retrieve 3-5 chunks. Too few and you risk missing relevant information. Too many and you flood the model with noise, wasting context window tokens on irrelevant content.

3. Generation

The retrieved chunks get injected into the prompt as context, and the model generates a response. A typical prompt structure looks like:

System: Answer the user's question using ONLY the provided context.
If the context doesn't contain the answer, say so.

Context:
[chunk 1]
[chunk 2]
[chunk 3]

User: What is our refund policy for enterprise customers?

The system instruction is critical. Without explicit grounding instructions, the model will blend retrieved context with its own training data, which defeats the purpose. You want the model to be a reader, not a guesser.

Why RAG Instead of Alternatives

Without RAG, you have two options for giving models access to your data:

Fine-tuning bakes knowledge into the model’s weights. It’s expensive (hundreds to thousands of dollars per training run), slow (hours to days), and requires retraining whenever your data changes. It’s also better suited for teaching the model a style or behavior than for stuffing it with facts. Fine-tuning a model to know your company’s policies is like memorizing an encyclopedia: expensive, fragile, and outdated the moment something changes.

Context stuffing means pasting your entire knowledge base into the prompt. This hits context window limits quickly. A 50-page document might use 15,000+ tokens, and that’s before your question and the response. Even with models that support 128K or 200K tokens, you pay per token, and models perform worse when relevant information is buried in the middle of a very long context.

RAG is the practical middle ground. You retrieve only the relevant chunks for each specific question, keeping costs low and relevance high. Your knowledge base can be gigabytes in size, but each query only sends a few hundred tokens of context to the model.

Where RAG Actually Shines

Customer support and internal knowledge bases. An employee asks “What’s the process for requesting a hardware upgrade?” The system retrieves the relevant policy document and generates a clear answer citing the specific section. No one has to search through a 200-page employee handbook.

Up-to-date information. Models have a training cutoff. RAG lets them reference data from yesterday, or from five minutes ago if your indexing pipeline is fast enough. A financial analyst can ask about yesterday’s earnings call because the transcript was indexed overnight.

Verifiable responses. Because you know which chunks were retrieved, you can show the user the source material alongside the answer. This makes fact-checking possible and builds trust. “Here’s what the model said, and here’s the exact passage it drew from.”

Domain-specific accuracy. Medical, legal, financial, and technical domains where generic model knowledge isn’t reliable enough. A legal chatbot that retrieves actual contract language is far more trustworthy than one generating legal-sounding text from training data.

Where RAG Breaks Down

Bad Retrieval

This is the number one failure mode. If the retrieval step returns irrelevant chunks, the model generates answers based on irrelevant context. The model doesn’t know the chunks are wrong. It trusts what you give it.

Common causes: poor embeddings (the model doesn’t capture your domain’s semantics well), bad chunking (relevant information split across chunks that individually don’t make sense), or a mismatch between how users ask questions and how documents are written.

The fix is almost always better retrieval, not a better generation model. Switching from GPT-4 to GPT-4o won’t help if you’re feeding it the wrong context.

Hybrid Search

Pure vector search has a blind spot: exact matches. If a user asks about “error code E-4012,” semantic search might return chunks about error handling in general, because the embedding model treats “E-4012” as a minor detail in a broader concept. Keyword search would find it instantly.

Production RAG systems combine both. Vector search handles the “what do you mean?” part. Keyword search handles the “find this exact thing” part. Most vector databases now support hybrid queries that blend both approaches, weighted by a tunable parameter.

Re-Ranking

Initial retrieval casts a wide net. The top 20 results from vector search are roughly relevant, but the ranking isn’t precise. A cross-encoder re-ranker takes each candidate and the original query as a pair, scoring how well the candidate actually answers the query. This is more expensive per-comparison than vector search (which is why you don’t use it for the initial search), but much more accurate for the final selection.

The typical pattern: retrieve 20 candidates with vector search, re-rank them with a cross-encoder, keep the top 3-5.

The Model Ignores the Context

Sometimes the model’s training data contains strong opinions about a topic, and it generates from memory instead of the provided context. Ask “What is our company’s stance on remote work?” and the model might generate a generic answer about remote work trends instead of referencing the policy document you retrieved.

The fix is prompt engineering. Explicitly instruct: “Only use the provided context. If the answer is not in the context, say ‘I don’t have that information.’” Some teams add a verification step where a second model call checks whether the response actually references the retrieved chunks.

Stale Indexes

If your data changes but your embeddings don’t get updated, the model answers based on outdated information. This is operationally mundane but critically important. RAG pipelines need a data freshness strategy: re-index on a schedule, trigger re-indexing when documents change, or version your embeddings so you can roll back if something goes wrong.

Production Architecture

A minimal RAG system has four components: a document processor (chunks and embeds), a vector store (stores and searches embeddings), a retrieval layer (converts queries to embeddings and fetches results), and an LLM (generates the final response).

Production systems add layers:

Metadata filtering. Tag chunks with metadata (department, date, access level, document type) and filter before similarity search. A user in engineering shouldn’t get HR-only results.

Query transformation. Rewrite the user’s question to improve retrieval. “It’s not working” becomes “troubleshooting common error messages in the API.” This can be done with a lightweight model call before retrieval.

Evaluation pipelines. Measure retrieval quality systematically. Build a test set of 50-100 questions with known correct source documents. Run it after every change to chunking, embedding models, or search parameters. If recall drops, you catch it before users do.

Guardrails. Check the model’s response for consistency with the retrieved context. Flag answers that don’t reference the provided chunks. Detect when the model is generating from its training data instead of the context.

The Bigger Picture

RAG is one of the most important patterns in applied AI. It bridges the gap between what models know from training and what they need to know for your specific use case. The people who build effective RAG systems aren’t the ones with the best models. They’re the ones with the best retrieval.

Chapter 2 of Get Insanely Good at AI covers the mechanics of embeddings and vector search in depth, and Chapter 5 walks through building production RAG systems with evaluation, re-ranking, and hybrid search.