Ai Engineering 6 min read

Context Windows Explained: Why Your AI Forgets

Context windows determine how much an AI model can 'see' at once. Here's what they are technically, how attention scales, and practical strategies for working within their limits.

You’re in the middle of a long conversation with an AI model. You reference something you said 20 messages ago. The model has no idea what you’re talking about. It’s not broken. It’s out of context.

Every AI model has a context window: a fixed limit on how much text it can process at once. Everything inside that window (your prompt, the conversation history, the system instructions, and the model’s own response) has to fit. Anything beyond it is invisible to the model. It doesn’t “remember” it. It never sees it.

What a Context Window Actually Is

Technically, the context window is the maximum sequence length the model’s transformer architecture can process. During training, the model learns positional encodings for token positions up to this limit. Tokens beyond the window literally have no positional representation, so the model can’t process them.

Context is measured in tokens, not words or characters. Tokenization is model-specific, but a rough estimate is 1 token per 3-4 English characters. “Hello world” is 2-3 tokens. “Retrieval-Augmented” becomes 3 tokens (the hyphenated compound gets split). A page of prose is roughly 400-500 tokens. A page of dense code might be 300-600, depending on how many common programming keywords appear (common tokens like function, return, and const each get a single token, while unusual variable names get split).

Current context window sizes vary dramatically:

  • GPT-4o: 128K tokens (~250 pages)
  • Claude 3.5 Sonnet: 200K tokens (~400 pages)
  • Gemini 1.5 Pro: up to 2M tokens (~4,000 pages)
  • Llama 3: 8K-128K tokens depending on the variant

Bigger isn’t always better, for reasons we’ll get into.

Why Context Windows Have Limits: The Attention Cost

The transformer’s self-attention mechanism is what gives models their power, and what creates the context limit. Self-attention compares every token to every other token to understand relationships. For a sequence of n tokens, that’s comparisons.

Double the context length and you quadruple the compute. A 4K context requires 16 million attention operations. A 128K context requires 16 billion. A 2M context (Gemini’s max) requires 4 trillion.

Models that support very long contexts use architectural optimizations to make this tractable: FlashAttention reduces memory overhead by computing attention in blocks, sliding window attention limits how far each token looks back, and sparse attention patterns skip certain comparisons entirely. These optimizations make long contexts possible, but they don’t make them free. Longer contexts are slower, more expensive, and (crucially) lower quality.

The “Lost in the Middle” Problem

Research by Liu et al. (2023) demonstrated something counterintuitive: models perform best on information at the beginning and end of the context. Information buried in the middle gets significantly less attention.

In their experiments, when a relevant passage was placed at position 1 out of 20 documents, models found it reliably. At position 10 (the middle), accuracy dropped substantially. At position 20 (the end), it recovered.

This has practical implications. If you paste a 50-page document into a prompt and the answer is on page 25, the model may not find it, even though it’s technically within the context window. The model “sees” the tokens, but attention is distributed unevenly.

You can test this yourself. Hide a specific fact (like a code phrase) at different positions in a long prompt surrounded by filler text. The model finds it easily at the top and bottom, and struggles in the middle.

Why Conversations Degrade

In a chat interface, every message (yours and the model’s) accumulates in the context. A typical chat turn might be 200-500 tokens. After 50 turns, that’s 10K-25K tokens just for history. Add a system prompt and you’re eating significant context budget before the user even asks their next question.

When the context fills up, something has to go. Most chat systems silently drop the oldest messages. That’s why the model “forgets” your earlier instructions. The system prompt you carefully crafted? It might still be there. The constraints you established in message #3? Gone.

This is also why you sometimes notice a conversation getting “dumber” over time. The model isn’t degrading. It’s losing the context that made its early responses good.

Strategies That Actually Work

Be Ruthless About What You Include

Don’t paste entire files when a function will do. Don’t include the full conversation history when the last 3 messages contain everything the model needs. Every token you include is a token that competes for the model’s attention. More isn’t better. More relevant is better.

Front-Load Critical Information

Put your most important instructions and data at the beginning of the prompt. Don’t bury the key requirement on page 5 of a 10-page paste. If you have a system prompt and a long user message, the system prompt gets the privileged first position. Use it for your most important constraints.

Summarize Long Conversations

If you’re in a long chat, periodically ask the model to summarize the conversation. Then start a new conversation with that summary as the opening context. You lose nuance but retain key decisions and constraints. Some applications do this automatically every N turns.

Use Retrieval Instead of Stuffing

This is where RAG comes in. Instead of pasting 50 pages into the prompt, use an embedding-based retrieval system to find the 3 most relevant paragraphs and include only those. You get better results with fewer tokens, because the model’s attention is focused on high-relevance content instead of spread across a massive document.

Map-Reduce for Large Documents

When you need to process something bigger than the context window, break it into chunks, process each chunk independently, then combine the results. Summarize each chapter of a book separately, then synthesize the chapter summaries into an overall summary. This works well for tasks like summarization, extraction, and analysis.

The Hidden Cost of Large Context Windows

Longer contexts cost more money and time. API pricing is per-token. A 100K-token prompt costs roughly 25x more than a 4K-token prompt. Latency increases too, because the model processes all input tokens through its attention layers before generating the first output token. A 100K input might take 10-15 seconds before you see any response.

A well-crafted 2,000-token prompt often outperforms a 50,000-token dump. The skill is in choosing what to include, not in including everything. Senior engineers working with AI know this: the best results come from precise, focused context, not from throwing everything at the model and hoping it finds what it needs.

What This Means Going Forward

Context windows are getting larger every model generation. But the fundamental constraints remain: attention quality degrades with length, cost scales linearly (at best), and the “lost in the middle” problem persists. Understanding these mechanics shapes how you design prompts, build applications, and architect AI systems.

Chapter 2 of Get Insanely Good at AI explains the transformer attention mechanics behind context windows, why position matters, and practical strategies for production systems that handle large documents reliably.