Chapter 7: Memory Systems
Memory Systems in Building Agentic AI Systems.
Learning Objectives
By the end of this chapter, you will be able to:
- Explain the agentic AI concept behind Memory Systems.
- Apply Memory Systems to design reliable, production-grade agent systems.
- Recognize operational trade-offs in tool use, orchestration, safety, and cost.
Chapter 7: Memory Systems
Four memory types, temporal knowledge graphs, and hybrid retrieval
Memory is Not the Context Window
The most common misconception in agent design: treating the context window as memory. The context window is a fixed-size working buffer. Memory is a set of external storage systems that selectively surface relevant information into that buffer on demand.
Why context ≠ memory
- Context windows are expensive — GPT-4o charges per token; a 128K context with 10K tokens of conversation history costs significantly more per turn than one with only relevant 2K tokens
- Context windows are ephemeral — they disappear when the session ends
- Context windows degrade at long range — model performance on facts in the middle of a very long context is measurably worse than facts at the start or end (the "lost in the middle" problem)
The goal of a memory system is to decide what is worth putting in the context window for a given query, and to persist information across sessions.
Four Memory Types
Temporal Knowledge Graphs
Standard vector databases only answer "what is similar to this query?" A temporal knowledge graph (e.g. Zep/Graphiti) also answers "how did things change over time?" and "what relationships exist between entities?" It combines semantic search with entity extraction, relationship modeling, and time-range filtering.
Temporal Knowledge Graph Structure
Edges carry timestamps — supports "what was true before/after date X?"
Zep's Graphiti achieves 94.8% accuracy on Deep Memory Retrieval benchmarks, compared to flat vector search baselines. The improvement comes from using graph traversal to chain related facts rather than returning independent document chunks.
Retrieval Strategies
"Tell me about Alice's recent work"
Dense vector (e.g., text-embedding-3)
Dense + BM25 + rerank
Injected into context
Dense vs Sparse vs Hybrid
| Strategy | Mechanism | Best For | Weakness |
|---|---|---|---|
| Dense (vector) | Cosine similarity on embeddings | Semantic similarity, paraphrases | Misses exact keyword matches |
| Sparse (BM25) | TF-IDF keyword matching | Exact terms, codes, IDs | Misses semantic similarity |
| Hybrid | Dense + sparse, score fusion | General purpose — best recall | More complex pipeline, higher latency |
| Reranking | Cross-encoder re-scores top-K | Precision on top-1 result | Added latency; requires 2nd model call |
Delta compression for multi-turn agents
RetainDB's research shows delta compression of episodic memory achieves 50–90% token savings in multi-turn scenarios by storing only what changed between turns rather than the full conversation state. This is especially valuable for long-running agent sessions where memory accumulates quickly.
Memory System Implementation
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from datetime import datetime
from typing import Any
import json
@dataclass
class MemoryEntry:
content: str
metadata: dict[str, Any] = field(default_factory=dict)
timestamp: datetime = field(default_factory=datetime.utcnow)
memory_type: str = "episodic" # episodic | semantic | procedural
class MemoryBackend(ABC):
@abstractmethod
def store(self, entry: MemoryEntry) -> str:
"""Store a memory entry; return its ID."""
@abstractmethod
def search(self, query: str, top_k: int = 5, memory_type: str | None = None) -> list[MemoryEntry]:
"""Retrieve the most relevant memories for the query."""
class ChromaMemoryBackend(MemoryBackend):
"""Semantic memory using ChromaDB and OpenAI embeddings."""
def __init__(self, collection_name: str = "agent_memory") -> None:
import chromadb
from chromadb.utils import embedding_functions
self._client = chromadb.PersistentClient(path="./chroma_store")
self._ef = embedding_functions.OpenAIEmbeddingFunction(
model_name="text-embedding-3-small"
)
self._collection = self._client.get_or_create_collection(
name=collection_name, embedding_function=self._ef
)
def store(self, entry: MemoryEntry) -> str:
import uuid
entry_id = str(uuid.uuid4())
self._collection.add(
documents=[entry.content],
metadatas=[{**entry.metadata, "type": entry.memory_type, "ts": entry.timestamp.isoformat()}],
ids=[entry_id],
)
return entry_id
def search(self, query: str, top_k: int = 5, memory_type: str | None = None) -> list[MemoryEntry]:
where = {"type": memory_type} if memory_type else None
results = self._collection.query(
query_texts=[query], n_results=top_k, where=where
)
entries = []
for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
entries.append(MemoryEntry(
content=doc,
metadata=meta,
timestamp=datetime.fromisoformat(meta.get("ts", datetime.utcnow().isoformat())),
memory_type=meta.get("type", "semantic"),
))
return entries
Chapter 7 Quiz
1. What is the "lost in the middle" problem?
2. What advantage does a temporal knowledge graph have over a flat vector database?
3. Why does hybrid search (dense + BM25) outperform either approach alone?