Ai Agents 7 min read

How to Add Memory to AI Agents

AI agents without memory forget everything between turns. Here's how to implement conversation buffers, sliding windows, summary memory, and vector-backed long-term recall.

An LLM has no memory. Every API call starts from zero. The model doesn’t remember what you said five seconds ago unless you explicitly include it in the next request. This is fine for single-turn tasks, but agents that take actions over multiple steps, maintain context across conversations, or learn from past interactions need memory systems built around them.

Memory is what turns a stateless text generator into something that behaves like an assistant. Here’s how the different approaches work and when to use each one.

Conversation Buffer Memory

The simplest form of memory: append every message to a list and send the full list with each request.

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "My name is Alex."},
    {"role": "assistant", "content": "Nice to meet you, Alex."},
    {"role": "user", "content": "What's my name?"},
]

The model sees the entire history and can reference anything said earlier. This works perfectly for short conversations. The problem is context windows. Every model has a token limit. A 128K context window sounds large, but a long conversation with tool calls, retrieved documents, and system prompts fills it faster than you’d expect.

Use when: conversations are short (under 20 turns), context window is large enough, and you need perfect recall of everything said.

Sliding Window Memory

Keep only the last N turns of conversation. Older messages get dropped.

WINDOW_SIZE = 10

def get_messages(history, system_prompt):
    return [
        {"role": "system", "content": system_prompt},
        *history[-WINDOW_SIZE:]
    ]

This bounds your context usage to a predictable size. The tradeoff is obvious: the agent forgets anything beyond the window. If the user mentioned their name 15 turns ago, it’s gone.

Use when: you need predictable context usage and the conversation is primarily about recent context. Good for customer support bots where older turns are usually irrelevant.

Summary Memory

Instead of dropping old messages, summarize them. Keep a running summary of the conversation so far, plus the most recent turns in full.

The approach: periodically ask the model to compress the conversation history into a summary, then replace the old messages with that summary. New messages accumulate until the next compression cycle.

async def compress_history(messages):
    summary_prompt = f"Summarize this conversation in 2-3 sentences, preserving key facts and decisions:\n\n"
    for msg in messages:
        summary_prompt += f"{msg['role']}: {msg['content']}\n"

    summary = await llm.generate(summary_prompt)
    return {"role": "system", "content": f"Conversation summary: {summary}"}

This preserves the gist of the conversation within a bounded token budget. The tradeoff is lossy compression. Details get dropped. Exact numbers, specific requests, and nuanced preferences can disappear in summarization.

Use when: conversations are long, you need some memory of earlier context, and approximate recall is acceptable.

Vector-Backed Long-Term Memory

For agents that need to remember information across sessions or recall specific facts from deep in a conversation history, vector databases provide long-term memory.

The pattern:

  1. After each exchange, embed key information (facts, preferences, decisions) and store the embeddings in a vector database alongside the original text.
  2. Before each new response, embed the current query and search the vector store for relevant past information.
  3. Include the retrieved memories in the prompt as additional context.
async def remember(text, user_id):
    embedding = await embed(text)
    await vector_db.upsert(
        id=generate_id(),
        vector=embedding,
        metadata={"user_id": user_id, "timestamp": now(), "text": text}
    )

async def recall(query, user_id, top_k=5):
    embedding = await embed(query)
    results = await vector_db.query(
        vector=embedding,
        filter={"user_id": user_id},
        top_k=top_k
    )
    return [r.metadata["text"] for r in results]

This scales to unlimited history. The agent can recall a user preference mentioned months ago if it’s semantically relevant to the current query. Storage cost is minimal compared to LLM API costs.

The challenge is deciding what to remember. Storing every message creates noise. Storing too little misses important context. A practical middle ground: extract and store structured facts (“User prefers metric units”, “Budget is $5,000”, “Deployment target is AWS”) rather than raw conversation turns.

Use when: agents need cross-session memory, the information to recall is factual and specific, and the history is too large for any context-window-based approach.

Entity Memory

A specialized form of structured memory that tracks information about specific entities (people, projects, products) mentioned in conversations.

Instead of storing raw text, maintain a key-value store of entity profiles:

entities = {
    "Alex": {
        "role": "Engineering Manager",
        "team_size": 12,
        "preferences": ["prefers Slack over email", "uses metric units"],
        "last_updated": "2026-03-15"
    }
}

Update these profiles after each conversation. Include relevant entity profiles in the prompt when those entities come up. This gives the agent structured, always-current knowledge about the entities it works with.

Use when: the agent interacts with a defined set of entities (team members, customers, projects) and needs consistent, up-to-date knowledge about each one.

Choosing the Right Memory Type

Most production agents combine multiple memory types:

Memory typeRecall qualityToken costComplexityBest for
BufferPerfectGrows linearlyTrivialShort conversations
Sliding windowRecent onlyFixedLowSupport bots, task-focused agents
SummaryApproximateBoundedMediumLong conversations, general assistants
Vector-backedSemantic matchPer-query retrieval costHighCross-session recall, knowledge-heavy agents
EntityStructured factsFixed per entityMediumCRM-style agents, personalization

A common production pattern: sliding window for the current conversation, vector-backed memory for cross-session recall, and entity memory for key profiles. The system prompt assembles context from all three before each model call.

Assembling Memory in Practice

When you combine multiple memory types, the prompt assembly step is where everything comes together. A typical pattern looks like this:

async def build_messages(user_message, user_id, conversation_history):
    system_parts = [BASE_SYSTEM_PROMPT]

    # Entity memory: include relevant profiles
    mentioned_entities = extract_entities(user_message)
    for entity in mentioned_entities:
        if entity in entity_store:
            system_parts.append(f"Known info about {entity}: {entity_store[entity]}")

    # Vector memory: retrieve relevant past context
    memories = await recall(user_message, user_id, top_k=3)
    if memories:
        system_parts.append("Relevant past context:\n" + "\n".join(memories))

    return [
        {"role": "system", "content": "\n\n".join(system_parts)},
        *conversation_history[-WINDOW_SIZE:],  # Sliding window
        {"role": "user", "content": user_message}
    ]

The order matters. System prompt first, then retrieved memories, then recent conversation, then the current message. This puts the most immediately relevant context closest to the generation point, which is where the model pays the most attention.

Memory Decay and Maintenance

Long-term memory accumulates noise over time. Preferences change, facts become outdated, and old context stops being useful. Without maintenance, your vector store fills with stale information that competes with current context during retrieval.

Strategies for keeping memory useful:

  • TTL (time-to-live): Automatically expire memories older than a threshold. Good for facts that change frequently (project status, active priorities).
  • Recency weighting: Blend semantic similarity with timestamp recency when ranking retrieved memories. A slightly less similar but recent memory often beats a highly similar but year-old one.
  • Explicit overwrite: When the agent detects updated information (“Actually, my budget is now $10,000”), replace the old memory rather than adding a new one alongside it.
  • Usage tracking: Track which memories the agent actually uses. Memories that are never retrieved after 90 days are candidates for archival.

For more on how these memory patterns fit into the broader agent architecture, see What Are AI Agents and How Do They Work? and the framework comparison in AI Agent Frameworks Compared.

The right memory system depends on your agent’s job. A coding assistant that works within a single session needs a buffer. A personal assistant that remembers your preferences across months needs vector-backed memory. Match the memory type to the recall pattern your users actually need.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading