Learning Objectives

By the end of this chapter, you will be able to:

Explain the agentic AI concept behind Memory Systems.
Apply Memory Systems to design reliable, production-grade agent systems.
Recognize operational trade-offs in tool use, orchestration, safety, and cost.

Section 2 — Core Building Blocks

Chapter 7: Memory Systems

Four memory types, temporal knowledge graphs, and hybrid retrieval

Memory is Not the Context Window

The most common misconception in agent design: treating the context window as memory. The context window is a fixed-size working buffer. Memory is a set of external storage systems that selectively surface relevant information into that buffer on demand.

Why context ≠ memory

Context windows are expensive — GPT-4o charges per token; a 128K context with 10K tokens of conversation history costs significantly more per turn than one with only relevant 2K tokens
Context windows are ephemeral — they disappear when the session ends
Context windows degrade at long range — model performance on facts in the middle of a very long context is measurably worse than facts at the start or end (the "lost in the middle" problem)

The goal of a memory system is to decide what is worth putting in the context window for a given query, and to persist information across sessions.

Four Memory Types

1

                                Working Memory — In-context State
                                The active conversation messages. Sub-10ms access. Used for the current task. Cleared when the session ends.
                            
2

                                Episodic Memory — What Happened
                                Timestamped log of past interactions, tool calls, and outcomes. Supports temporal queries: "what did we do last Tuesday?" Stored in relational DB or time-series store.
                            
3

                                Semantic Memory — What I Know
                                Facts and knowledge stored as vector embeddings. Supports similarity search: "what do I know about topic X?" Stored in vector DB (pgvector, Chroma, Qdrant).
                            
4

                                Procedural Memory — How to Do Things
                                Stored tool schemas, system prompt templates, workflow patterns, and learned skills. Consulted when the agent needs to know its own capabilities.
                            

Temporal Knowledge Graphs

Standard vector databases only answer "what is similar to this query?" A temporal knowledge graph (e.g. Zep/Graphiti) also answers "how did things change over time?" and "what relationships exist between entities?" It combines semantic search with entity extraction, relationship modeling, and time-range filtering.

Temporal Knowledge Graph Structure

User: Alice

—prefers→

Python

↓ worked_on

Project: API Refactor

—completed→

2026-04-01

↓ related_to

FastAPI migration

—blocked_by→

Auth module

Edges carry timestamps — supports "what was true before/after date X?"

Zep's Graphiti achieves 94.8% accuracy on Deep Memory Retrieval benchmarks, compared to flat vector search baselines. The improvement comes from using graph traversal to chain related facts rather than returning independent document chunks.

Retrieval Strategies

❓

Query

"Tell me about Alice's recent work"

→

🔢

Embed

Dense vector (e.g., text-embedding-3)

→

🔍

Hybrid Search

Dense + BM25 + rerank

→

📋

Top-K Chunks

Injected into context

Dense vs Sparse vs Hybrid

Strategy	Mechanism	Best For	Weakness
Dense (vector)	Cosine similarity on embeddings	Semantic similarity, paraphrases	Misses exact keyword matches
Sparse (BM25)	TF-IDF keyword matching	Exact terms, codes, IDs	Misses semantic similarity
Hybrid	Dense + sparse, score fusion	General purpose — best recall	More complex pipeline, higher latency
Reranking	Cross-encoder re-scores top-K	Precision on top-1 result	Added latency; requires 2nd model call

Delta compression for multi-turn agents

RetainDB's research shows delta compression of episodic memory achieves 50–90% token savings in multi-turn scenarios by storing only what changed between turns rather than the full conversation state. This is especially valuable for long-running agent sessions where memory accumulates quickly.

Memory System Implementation

python — pluggable memory backend for an agent

from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from datetime import datetime
from typing import Any
import json

@dataclass
class MemoryEntry:
    content: str
    metadata: dict[str, Any] = field(default_factory=dict)
    timestamp: datetime = field(default_factory=datetime.utcnow)
    memory_type: str = "episodic"   # episodic | semantic | procedural


class MemoryBackend(ABC):
    @abstractmethod
    def store(self, entry: MemoryEntry) -> str:
        """Store a memory entry; return its ID."""

    @abstractmethod
    def search(self, query: str, top_k: int = 5, memory_type: str | None = None) -> list[MemoryEntry]:
        """Retrieve the most relevant memories for the query."""


class ChromaMemoryBackend(MemoryBackend):
    """Semantic memory using ChromaDB and OpenAI embeddings."""

    def __init__(self, collection_name: str = "agent_memory") -> None:
        import chromadb
        from chromadb.utils import embedding_functions

        self._client = chromadb.PersistentClient(path="./chroma_store")
        self._ef = embedding_functions.OpenAIEmbeddingFunction(
            model_name="text-embedding-3-small"
        )
        self._collection = self._client.get_or_create_collection(
            name=collection_name, embedding_function=self._ef
        )

    def store(self, entry: MemoryEntry) -> str:
        import uuid
        entry_id = str(uuid.uuid4())
        self._collection.add(
            documents=[entry.content],
            metadatas=[{**entry.metadata, "type": entry.memory_type, "ts": entry.timestamp.isoformat()}],
            ids=[entry_id],
        )
        return entry_id

    def search(self, query: str, top_k: int = 5, memory_type: str | None = None) -> list[MemoryEntry]:
        where = {"type": memory_type} if memory_type else None
        results = self._collection.query(
            query_texts=[query], n_results=top_k, where=where
        )
        entries = []
        for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
            entries.append(MemoryEntry(
                content=doc,
                metadata=meta,
                timestamp=datetime.fromisoformat(meta.get("ts", datetime.utcnow().isoformat())),
                memory_type=meta.get("type", "semantic"),
            ))
        return entries