How to Build a RAG Application (Step by Step)
A practical walkthrough of building a RAG pipeline from scratch: chunking documents, generating embeddings, storing vectors, retrieving context, and generating grounded answers.
RAG (Retrieval-Augmented Generation) gives LLMs access to your data without fine-tuning. You retrieve relevant chunks from a knowledge base, inject them into the prompt, and the model answers from that context. The architecture is straightforward. The devil is in the details: chunk size, embedding choice, retrieval count, and prompt structure all determine whether your RAG system works or fails.
This walkthrough covers the conceptual steps. Implement in any language or framework. For a deeper dive into the patterns and pitfalls, Get Insanely Good at AI covers RAG in production.
Prerequisites: What You Need Before Starting
Before writing a single line of code, you need four things.
Documents. Your knowledge base. PDFs, markdown files, database records, web pages, internal wikis. Whatever format your information lives in. RAG works best when the source material is well-structured and reasonably clean. Garbage in, garbage out still applies.
An embedding model. Something that converts text into vectors. Embeddings capture semantic meaning so you can search by similarity rather than keywords. You’ll use this for both indexing (document chunks) and querying (user questions). The same model must handle both, or the vectors won’t be comparable.
A vector store. A database that stores embeddings and supports similarity search. Options include Pinecone, Weaviate, Qdrant, Chroma, and pgvector. For most starting projects, pgvector is the simplest: it’s a PostgreSQL extension, so you get vector search in a database you probably already use.
An LLM. The model that generates answers from the retrieved context. Any capable model works: GPT-4o, Claude, Llama, Mistral. The model doesn’t need to know your domain. It just needs to follow instructions and answer from the context you provide.
Step 1: Load and Chunk Your Documents
Chunking is where most RAG pipelines succeed or fail. Split too large and you dilute the signal. Split too small and you lose context. Split at arbitrary boundaries and you cut sentences or paragraphs in half.
Chunk size. 200 to 500 tokens is a typical starting point. That’s roughly 150 to 400 words. Smaller chunks give more precise retrieval but risk losing context that spans multiple chunks. Larger chunks preserve context but may include irrelevant material that dilutes similarity scores. Start with 300 tokens and adjust based on your content. Technical docs with dense sections might need 500. Conversational content might work better at 200.
Overlap. 50 to 100 tokens of overlap between adjacent chunks helps. If chunk 1 ends mid-thought and chunk 2 picks it up, the overlap ensures both chunks contain the complete idea. Without overlap, a question about a concept that spans a chunk boundary might only retrieve one half, and the model gets incomplete context. 10 to 20 percent overlap is a reasonable default.
Chunk boundaries matter. Generic splitting (every N characters or tokens) often breaks meaning. A markdown file should chunk at headings. A codebase should chunk at function or class boundaries. A legal document should chunk at clause boundaries. Document-aware chunking preserves semantic units. If your documents have structure (headings, sections, paragraphs), use it. If not, at least split on sentence boundaries rather than mid-word.
Step 2: Generate Embeddings
Each chunk becomes a vector. The embedding model takes text as input and outputs a dense numerical representation. Similar meaning produces similar vectors. Your query gets embedded the same way, and you search for chunks whose vectors are closest to the query vector.
Which models to use. OpenAI’s text-embedding-3-small is the default choice for most projects. Good quality, 1,536 dimensions, cheap. Cohere’s embed-v3 is competitive and often better for multilingual content. For open-source, BGE (BAAI General Embedding) models like bge-large-en-v1.5 perform well and run locally. Voyage AI offers strong domain-tuned models if you need specialized embeddings.
Cost considerations. Embedding APIs charge per token. OpenAI’s text-embedding-3-small costs $0.02 per 1M tokens. A 100-page document might be 50,000 tokens. Indexing 1,000 such documents costs about $1. One-time. Query embeddings are tiny (typically 10 to 50 tokens per question). The expensive part of RAG is usually the LLM call, not the embeddings. For high-volume indexing, open-source models eliminate embedding costs entirely.
Step 3: Store in a Vector Database
Your embeddings need to live somewhere that supports fast similarity search. When a user asks a question, you embed it and find the K nearest chunk vectors in milliseconds.
Options. Pinecone is managed and simple. Weaviate and Qdrant are open-source, self-hosted, or managed. Chroma is lightweight and good for prototyping. pgvector is a PostgreSQL extension: add a vector column, create an index, and you have vector search in your existing database. No new infrastructure. No new billing. For most starting projects, pgvector is the simplest path. You already know PostgreSQL. The extension is well-documented. You can scale to dedicated vector databases later if needed.
Indexing. Create an index on the vector column. Most stores use HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index) for approximate nearest neighbor search. Exact search doesn’t scale. Approximate search returns the right results 99 percent of the time in a fraction of the time. Store metadata with each chunk (source document, section, page number) so you can filter and cite later.
Step 4: Retrieve Relevant Chunks
When a user asks a question, embed it with the same model you used for indexing. Search the vector store for the K nearest chunks. The standard similarity metric is cosine similarity: the cosine of the angle between vectors. Most vector databases handle this automatically. You specify the metric at index creation; queries return results ranked by similarity.
How many chunks to retrieve. 3 to 5 is a common starting point. Too few and you risk missing the answer. Too many and you flood the context with noise, wasting tokens and confusing the model. Start with 4. If answers are incomplete, try 5 or 6. If answers are inconsistent or include irrelevant details, try 3. The right number depends on your chunk size and how focused your documents are.
Re-ranking for quality. Vector similarity isn’t perfect. Sometimes the top result by cosine similarity isn’t the most relevant. A re-ranker model (like Cohere’s rerank or cross-encoder models) takes the query and each candidate chunk and scores relevance more accurately. Retrieve 10 or 20 chunks, re-rank them, keep the top 3 to 5. This adds latency and cost but often improves answer quality significantly. Use it when simple retrieval isn’t good enough.
Step 5: Augment the Prompt and Generate
The retrieved chunks become context. Your job is to structure the prompt so the model answers from that context and only that context.
Prompt structure. A typical pattern:
System: Answer the user's question using ONLY the provided context below.
If the context does not contain enough information to answer, say "I don't know" or "The provided documents don't contain this information."
Do not use external knowledge. Do not make up facts.
Context:
[chunk 1]
[chunk 2]
[chunk 3]
[chunk 4]
User: [the actual question]
The system instruction is critical. Without explicit grounding, the model blends retrieved context with its training data. It might answer correctly, or it might hallucinate. You want the model to behave as a reader: if the answer isn’t in the context, it should say so. If it is, it should cite it.
Handling “I don’t know.” When retrieval fails (wrong chunks, no relevant documents, question out of scope), the model needs to refuse gracefully. Instruct it explicitly: “If the context doesn’t contain the answer, say so.” Test with questions that have no answer in your knowledge base. A model that confidently hallucinates when it shouldn’t is worse than one that admits ignorance.
Structured output. If you need citations, extracted facts, or a specific format, use structured output (JSON mode, function calling) to constrain the response. The model can return an object with answer and sources fields, which your application parses and renders.
Evaluation: How to Measure RAG Quality
You can’t improve what you don’t measure. RAG quality breaks down into three dimensions.
Faithfulness. Does the answer stay within the retrieved context? Or does the model invent facts, blend in training knowledge, or contradict the sources? Faithfulness is binary per claim: each factual claim in the answer should be supported by the context. Automated checkers use an LLM to verify this.
Relevance. Did retrieval return the right chunks? A question about refund policy should retrieve chunks about refunds, not shipping or returns in general. Relevance is measured by whether the retrieved chunks actually contain the answer. If they don’t, the model has no chance.
Answer correctness. Given the context, did the model produce the right answer? This is the end-to-end metric. You need a test set of questions with known correct answers. Compare model output to ground truth. Exact match is strict; semantic similarity (embedding-based) is more forgiving.
RAGAS framework. RAGAS (Retrieval Augmented Generation Assessment) automates these metrics. It uses LLMs to score faithfulness and relevance, and can compute answer correctness against a reference. Run it on a curated test set before and after changes. If you change chunk size, embedding model, or retrieval count, re-evaluate. What you don’t measure, you can’t fix.
Common Pitfalls
Wrong chunk size. Too large and retrieval returns broad, unfocused chunks. The model has to find the relevant sentence in a 500-token block. Too small and you split ideas across chunks, so no single chunk contains the full answer. Tune based on your content. There’s no universal right answer.
Missing metadata. Store source document, section, and position with each chunk. When the model cites something, you need to show the user where it came from. Without metadata, you can’t build “View source” or “See original document” features. You also can’t filter retrieval by document type, date, or category.
Not handling “I don’t know” cases. Models default to answering. If the context is empty or irrelevant, they’ll guess. Explicit instructions to refuse when the answer isn’t in the context reduce hallucinations. Test with out-of-scope questions. If your model confidently answers “What is our policy on Mars colonies?” when you have no such policy, fix the prompt.
Stuffing too many chunks into context. More context isn’t always better. Beyond 5 to 7 chunks, you add noise. The model’s attention dilutes. Irrelevant chunks can steer the answer wrong. Retrieve fewer, higher-quality chunks. Use re-ranking if simple retrieval isn’t selective enough.
RAG is a pipeline. Each step affects the next. Bad chunking hurts retrieval. Bad retrieval hurts generation. Get the fundamentals right before optimizing. Start simple: 300-token chunks, 50-token overlap, 4 chunks per query, explicit grounding instructions. Measure with a small test set. Iterate from there. For more on building production RAG systems and when to choose RAG over fine-tuning, see Get Insanely Good at AI.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
NVIDIA's Agentic Retrieval Pipeline Tops ViDoRe v3 Benchmark
NVIDIA’s NeMo Retriever shows how ReACT-style agentic retrieval can boost benchmark scores—while exposing major latency and cost trade-offs.
How to Choose a Vector Database in 2026
Pinecone, Weaviate, Qdrant, pgvector, or Chroma? Here's how to pick the right vector database for your AI application based on scale, infrastructure, and actual needs.
What Is RAG? Retrieval-Augmented Generation Explained
RAG lets AI models pull in real data before generating a response. Here's how retrieval-augmented generation works, why it matters, and where it breaks down.
Fine-Tuning vs RAG: When to Use Each Approach
RAG changes what the model knows. Fine-tuning changes how it behaves. Here's when to use each approach, their real tradeoffs, and why the answer is usually both.
How to Evaluate AI Output (LLM-as-Judge Explained)
Traditional tests don't work for AI output. Here's how to evaluate quality using LLM-as-judge, automated checks, human review, and continuous evaluation frameworks.