Chapter 4: Vector Databases
Storing and Searching Embeddings
Learning Objectives
- Understand vector databases fundamentals
- Master the mathematical foundations
- Learn practical implementation
- Apply knowledge through examples
- Recognize real-world applications
Vector Databases
Why Vector Databases are Essential for RAG
Once you've created embeddings for your documents, you need to store them somewhere and search through them efficiently. A traditional database (like PostgreSQL or MySQL) is designed for exact matches and structured queries, not for finding "similar" vectors. Vector databases are specialized databases optimized for storing and searching high-dimensional vectors using similarity metrics like cosine similarity.
The scale problem: In production RAG systems, you might have millions or billions of document chunks, each with a 384-1536 dimensional embedding vector. Searching through all of them to find the most similar to a query vector would take hours using brute-force methods. Vector databases use sophisticated indexing algorithms (like HNSW, IVF, or LSH) to enable sub-millisecond similarity search even across billions of vectors.
What Vector Databases Provide
- Fast Similarity Search: Find top-k most similar vectors in milliseconds, even with millions of documents
- Scalable Storage: Efficiently store and index billions of high-dimensional vectors
- Metadata Filtering: Combine vector similarity search with traditional filters (date, category, author, etc.)
- Real-time Updates: Add, update, or delete vectors without rebuilding entire indexes
- Approximate Nearest Neighbor (ANN): Trade some accuracy for massive speed improvements (1000-10000x faster than exact search)
Performance Comparison:
Brute-force search (NumPy): To find top-5 most similar vectors among 1 million documents:
- ❌ Compute 1 million cosine similarities: ~16 minutes
- ❌ Sort results: Additional time
- ❌ Total: ~16+ minutes per query (completely impractical)
Vector database (HNSW index): Same task:
- ✅ Find top-5 similar vectors: 50-200 milliseconds
- ✅ Speedup: ~5,000-20,000x faster!
- ✅ Enables real-time RAG systems
Key Concepts You'll Learn
- Indexing Algorithms: HNSW (Hierarchical Navigable Small World), IVF (Inverted File Index), Product Quantization, and LSH - how they work and when to use each
- Vector Database Options: Pinecone, Weaviate, Chroma, Qdrant, FAISS, Milvus - comparing features, performance, and use cases
- Metadata Filtering: Combining vector similarity search with traditional database filters for precise retrieval
- Exact vs Approximate Search: Understanding the trade-offs between accuracy and speed
- Scaling Strategies: Sharding, distributed indexing, and optimization techniques for billion-scale systems
- Production Considerations: Update strategies, index maintenance, and performance monitoring
Why this matters: Vector databases are the infrastructure that makes RAG systems practical at scale. Without them, you're limited to small document collections or unacceptably slow query times. Choosing the right vector database and indexing strategy directly impacts your RAG system's performance, cost, and scalability.
Key Concepts
Vector Database Indexing Strategies: How Fast Retrieval Works
Vector databases use sophisticated indexing algorithms to enable fast similarity search across millions of vectors. Understanding these indexing strategies is crucial for choosing the right database and optimizing performance.
1. HNSW (Hierarchical Navigable Small World)
What it is: HNSW is one of the most popular and effective indexing algorithms for approximate nearest neighbor (ANN) search. It builds a multi-layer graph where each layer is a subset of the previous layer, creating a "small world" network that allows efficient navigation.
How it works:
- Multi-layer structure: Creates multiple layers (levels) of graphs, with the top layer having few nodes and bottom layer having all nodes
- Greedy search: Starts at the top layer, finds the nearest neighbor, then moves to the next layer and continues
- Small world property: Each node is connected to a small number of "long-range" connections, allowing fast navigation across the graph
- Dynamic insertion: New vectors can be added without rebuilding the entire index
Advantages:
- ✅ Very fast: Sub-millisecond search times even for millions of vectors
- ✅ High accuracy: Can achieve 95%+ recall (finds 95% of true nearest neighbors)
- ✅ Scalable: Works well with billions of vectors
- ✅ Used by major databases: Pinecone, Weaviate, Qdrant all use HNSW variants
Trade-offs:
- ⚠️ Memory intensive: Stores the graph structure in memory (though can be optimized)
- ⚠️ Indexing time: Building the index takes time, though it's a one-time cost
When to use: For production RAG systems where speed and accuracy are critical. This is the default choice for most vector databases.
2. IVF (Inverted File Index)
What it is: IVF partitions the vector space into clusters (Voronoi cells) and creates an inverted index mapping each cluster to its vectors.
How it works:
- Clustering: Uses k-means or similar to partition vectors into clusters
- Inverted index: For each cluster, stores a list of vectors belonging to that cluster
- Search: Finds the nearest cluster(s) to the query vector, then searches only within those clusters
Advantages:
- ✅ Memory efficient: Lower memory footprint than HNSW
- ✅ Fast for large datasets: Only searches relevant clusters, not all vectors
- ✅ Used by FAISS: FAISS's IVF-Flat and IVF-PQ use this approach
Trade-offs:
- ⚠️ Lower accuracy: May miss vectors near cluster boundaries
- ⚠️ Requires tuning: Number of clusters needs careful selection
When to use: For very large datasets (billions of vectors) where memory is a constraint, or when using FAISS.
3. Product Quantization (PQ)
What it is: A compression technique that reduces vector storage by quantizing (discretizing) vector components into a smaller number of values.
How it works:
- Vector splitting: Splits each vector into multiple sub-vectors
- Quantization: Each sub-vector is mapped to a "codebook" (set of representative vectors)
- Compression: Stores only the codebook indices, not full vectors
- Fast distance: Uses lookup tables for fast approximate distance calculations
Advantages:
- ✅ Massive storage reduction: Can reduce storage by 10-100x
- ✅ Fast search: Approximate distances computed quickly using lookup tables
- ✅ Enables billion-scale: Makes it feasible to store billions of vectors
Trade-offs:
- ⚠️ Accuracy loss: Compression introduces approximation errors
- ⚠️ Training required: Codebooks need to be trained on representative data
When to use: For extremely large datasets where storage is a primary concern. Often combined with IVF (IVF-PQ).
4. LSH (Locality-Sensitive Hashing)
What it is: Uses hash functions that map similar vectors to the same hash buckets, enabling fast approximate search.
How it works:
- Hash functions: Creates multiple hash functions that preserve similarity (similar vectors hash to same bucket)
- Bucketing: Vectors are placed into hash buckets
- Search: Query vector is hashed, then only vectors in the same bucket(s) are searched
Advantages:
- ✅ Very fast: Constant-time hash lookup
- ✅ Simple: Easy to understand and implement
Trade-offs:
- ⚠️ Lower accuracy: May miss some similar vectors
- ⚠️ Parameter tuning: Number of hash functions and buckets needs tuning
When to use: For very fast, approximate search when some accuracy loss is acceptable. Less common in modern RAG systems.
Metadata Filtering: Combining Vector Search with Traditional Filters
Real-world RAG systems often need to filter documents by metadata (date, author, category, etc.) in addition to semantic similarity. Vector databases support this through metadata filtering.
How Metadata Filtering Works
Two-stage process:
- Filter first: Apply metadata filters to reduce the search space (e.g., "only documents from 2023")
- Search in filtered set: Perform vector similarity search only on the filtered documents
Example:
Query with Metadata Filtering
Query: "machine learning best practices"
Metadata filters:
- Category = "Technical Blog"
- Date >= "2023-01-01"
- Author = "John Doe"
Process:
- Filter 1M documents → 50K documents matching metadata
- Vector search in 50K documents → Top 5 most similar
✅ Much faster than searching all 1M documents!
Types of Metadata Filters
- Equality filters:
author = "John Doe" - Range filters:
date >= "2023-01-01" AND date <= "2023-12-31" - In filters:
category IN ["Tech", "Science"] - Boolean combinations:
(category = "Tech" OR category = "Science") AND date >= "2023"
Benefits of Metadata Filtering
- ✅ Faster search: Reduces the number of vectors to search
- ✅ More relevant results: Ensures results match business constraints
- ✅ Better user experience: Users can narrow down by date, source, etc.
- ✅ Compliance: Can filter by access permissions, data retention policies
Vector Database Operations: Indexing, Querying, and Maintenance
1. Indexing (One-Time Setup)
What happens: When you add documents to a vector database, they go through an indexing process:
- Embedding generation: Each document chunk is converted to a vector using an embedding model
- Metadata extraction: Extract and store metadata (title, date, author, etc.)
- Index building: Vector is inserted into the index structure (HNSW, IVF, etc.)
- Storage: Vector, metadata, and original text are stored
Performance considerations:
- ⚠️ Indexing is slow: Building indexes takes time (minutes to hours for large datasets)
- ⚠️ Batch processing: More efficient to index in batches rather than one-by-one
- ⚠️ Incremental updates: Some databases support adding vectors without rebuilding (HNSW), others require full rebuild (IVF)
2. Querying (Per-Request)
What happens: When a user query arrives:
- Query embedding: Convert query to vector using the same embedding model
- Metadata filtering (optional): Apply metadata filters to reduce search space
- Vector search: Use the index to find top-k most similar vectors
- Result retrieval: Return document IDs, metadata, and similarity scores
- Document fetching: Retrieve actual document text using IDs
Performance considerations:
- ✅ Very fast: Sub-100ms for millions of vectors with good indexes
- ✅ Scalable: Query time grows slowly with dataset size (logarithmic for HNSW)
- ⚠️ First query slower: May need to load index into memory
3. Maintenance Operations
Index updates: When documents are added, updated, or deleted:
- Add: Insert new vector into index (fast for HNSW, may require rebuild for IVF)
- Update: Delete old vector, insert new one (or update in-place if supported)
- Delete: Remove vector from index (mark as deleted or physically remove)
Index optimization:
- Rebuilding: Periodically rebuild index to optimize structure (especially for IVF)
- Compaction: Remove deleted vectors and optimize storage
- Monitoring: Track index size, query performance, accuracy metrics
Mathematical Formulations
Vector Database Performance Metrics
Vector databases use sophisticated indexing algorithms to enable fast similarity search. Understanding the mathematical foundations helps you choose the right database, configure indexes, and optimize performance. These formulas describe how vector databases achieve sub-millisecond search times even with millions of vectors.
1. HNSW Search Complexity
What This Represents:
HNSW (Hierarchical Navigable Small World) achieves logarithmic search time complexity, meaning search time grows very slowly as the number of vectors increases. This is why it can search millions of vectors in milliseconds.
Breaking It Down:
- \(T_{\text{search}}\): Time complexity of search operation
- \(O(\log N)\): Big-O notation indicating logarithmic time complexity
- \(N\): Number of vectors in the database
What Logarithmic Means:
If you double the number of vectors, search time increases by a constant amount (not doubled). For example:
- 1,000 vectors: ~10 operations
- 10,000 vectors: ~13 operations (only 30% more!)
- 1,000,000 vectors: ~20 operations (only 100% more for 1000x more data!)
Comparison to Brute-Force:
Brute-force: \(O(N)\) - linear time. To search 1M vectors, you must compare query with all 1M vectors.
HNSW: \(O(\log N)\) - logarithmic time. To search 1M vectors, you only need ~20 comparisons by navigating the graph structure.
Why HNSW is Fast:
HNSW builds a multi-layer graph where each layer has fewer nodes. Search starts at the top (few nodes), finds approximate location, then refines in lower layers. This hierarchical approach dramatically reduces comparisons needed.
2. IVF Cluster Search Reduction
What This Represents:
IVF (Inverted File Index) partitions vectors into clusters. Instead of searching all \(N\) vectors, you only search within the nearest cluster(s), dramatically reducing search space.
Breaking It Down:
- \(O(\sqrt{N})\): Time to find the nearest cluster(s) - grows with square root of N
- \(O(k)\): Time to search within the cluster(s) - constant or linear in cluster size
- Total: Much faster than \(O(N)\) brute-force search
How It Works:
- Partition all vectors into \(\sqrt{N}\) clusters using k-means
- For a query, find the nearest cluster(s): \(O(\sqrt{N})\) operations
- Search only within those clusters: \(O(k)\) where k is cluster size
- Total: \(O(\sqrt{N} + k)\) instead of \(O(N)\)
Example:
1,000,000 vectors partitioned into 1,000 clusters (1,000 vectors per cluster):
- Brute-force: Compare with all 1,000,000 vectors
- IVF: Find nearest cluster (1,000 comparisons) + search within cluster (1,000 comparisons) = 2,000 total comparisons
- Speedup: 500x faster!
Trade-off:
IVF is faster than brute-force but may miss vectors near cluster boundaries. HNSW is more accurate but uses more memory. Choose based on your accuracy vs speed requirements.
3. Product Quantization Compression Ratio
What This Measures:
Product Quantization (PQ) compresses vectors by quantizing sub-vectors into codebooks. This formula calculates the compression ratio achieved.
Breaking It Down:
- \(d\): Original vector dimension (e.g., 384)
- \(4 \text{ bytes}\): Size per float32 value (original storage)
- \(m\): Number of sub-vectors (e.g., 8 sub-vectors of 48 dimensions each)
- \(k\): Codebook size (number of quantization levels, e.g., 256)
- \(\log_2(k) \text{ bits}\): Bits needed to store codebook index (e.g., \(\log_2(256) = 8\) bits = 1 byte)
Example:
Original: 384-dimensional vector = 384 × 4 bytes = 1,536 bytes
PQ compressed: 8 sub-vectors × 1 byte = 8 bytes
Compression ratio: \(\frac{1536}{8} = 192x\) reduction!
Trade-off:
Higher compression = less storage but some accuracy loss. Typical PQ achieves 10-100x compression with minimal accuracy degradation (95%+ recall maintained).
4. Vector Database Query Time
What This Represents:
Total query time in a RAG system is the sum of embedding generation, vector search, and document retrieval times. Understanding this breakdown helps you optimize each component.
Breaking It Down:
- \(T_{\text{embed}}\): Time to convert query text to embedding vector (typically 10-50ms for local models, 50-200ms for API calls)
- \(T_{\text{search}}\): Time to find top-k similar vectors in the database (typically 10-100ms with HNSW for millions of vectors)
- \(T_{\text{retrieve}}\): Time to fetch actual document text using retrieved IDs (typically 1-10ms if documents are cached)
Typical Breakdown:
For a query in a system with 1M documents:
- Embedding: 20ms (local model) or 100ms (API)
- Vector search: 50ms (HNSW index)
- Document retrieval: 5ms (cached)
- Total: 75ms (local) or 155ms (API)
Optimization Strategies:
- Reduce \(T_{\text{embed}}\): Cache query embeddings for common queries, use faster embedding models
- Reduce \(T_{\text{search}}\): Use efficient indexes (HNSW), limit search space with metadata filters
- Reduce \(T_{\text{retrieve}}\): Cache documents in memory, use fast storage (SSD, in-memory cache)
5. Index Build Time
What This Represents:
Building an HNSW index takes \(O(N \log N)\) time, where \(N\) is the number of vectors. This is a one-time cost when indexing documents, but it's important to understand for planning indexing operations.
Breaking It Down:
- \(N\): Number of vectors to index
- \(O(N \log N)\): Time complexity - grows faster than linear but slower than quadratic
- For each vector, the algorithm needs to find its position in the graph structure
Practical Times:
- 10,000 vectors: ~1-5 seconds
- 100,000 vectors: ~30-120 seconds (1-2 minutes)
- 1,000,000 vectors: ~10-30 minutes
- 10,000,000 vectors: ~2-8 hours
Strategies for Large Datasets:
- Batch indexing: Index in batches rather than one-by-one
- Incremental updates: Use databases that support adding vectors without full rebuild (HNSW supports this)
- Parallel indexing: Use multiple CPU cores to speed up index building
- Background indexing: Build index in background while serving queries from old index
Detailed Examples
Example 1: Vector Database Indexing and Querying - Complete Workflow
Scenario: Setting up a vector database for a RAG system with 10,000 document chunks.
Step 1: Document Preparation
You have 10,000 document chunks, each already embedded:
- Chunk 1: "Machine learning is..." → Embedding: [0.45, -0.23, 0.67, ...] (384-dim)
- Chunk 2: "Deep learning uses..." → Embedding: [0.48, -0.25, 0.65, ...]
- ... (9,998 more chunks)
Step 2: Index Building
- All 10,000 embeddings are inserted into the vector database
- HNSW index is built (takes ~2-5 minutes for 10K vectors)
- Index creates a multi-layer graph structure for fast search
- Total storage: 10,000 × 384 × 4 bytes = ~15 MB (just for embeddings)
Step 3: Query Processing
User query: "What is neural network training?"
- Query embedding: [0.46, -0.24, 0.66, ...]
- Vector database searches the HNSW index
- Finds top-5 most similar vectors in ~10-50 milliseconds
- Returns document IDs: [doc_123, doc_456, doc_789, doc_234, doc_567]
Step 4: Document Retrieval
- Using document IDs, fetch actual text from storage
- Returns: ["Neural networks are trained using...", "Training involves...", ...]
- Total query time: ~50-100ms (embedding + search + retrieval)
Performance: Without vector database (brute-force), this would take ~16 seconds. With HNSW index, it takes ~50ms - a 320x speedup!
Example 2: Metadata Filtering in Action
Scenario: A knowledge base with documents from different years, and you want to filter by date before similarity search.
Knowledge Base: 1,000,000 documents
- 500,000 documents from 2023
- 300,000 documents from 2024
- 200,000 documents from 2025
Query: "Latest machine learning trends"
Without Metadata Filtering:
- Search all 1,000,000 documents
- Time: ~200ms
- Results might include outdated 2023 documents
With Metadata Filtering (date >= 2024):
- Filter to 500,000 documents (2024 + 2025)
- Search only in filtered set
- Time: ~100ms (50% faster!)
- Results are more recent and relevant
Combined Filter Example:
Query: "Python machine learning tutorials from 2024"
- Metadata filters: category="tutorial", year=2024, language="Python"
- Filter from 1M → 50,000 documents
- Vector search in 50K documents: ~30ms
- ✅ Much faster and more precise than searching all 1M documents
Example 3: HNSW vs Brute-Force Performance Comparison
Scenario: Comparing search performance for different database sizes.
Test Setup: Find top-5 most similar vectors to a query
1,000 Documents:
- Brute-force: Compare query with all 1,000 vectors = 1,000 comparisons = ~16ms
- HNSW: Navigate graph structure = ~10 comparisons = ~2ms
- Speedup: 8x faster
100,000 Documents:
- Brute-force: 100,000 comparisons = ~1.6 seconds
- HNSW: ~15 comparisons = ~5ms
- Speedup: 320x faster!
1,000,000 Documents:
- Brute-force: 1,000,000 comparisons = ~16 seconds (unacceptable!)
- HNSW: ~20 comparisons = ~50ms
- Speedup: 320x faster!
10,000,000 Documents:
- Brute-force: 10,000,000 comparisons = ~2.7 minutes (completely impractical!)
- HNSW: ~25 comparisons = ~100ms
- Speedup: 1,600x faster!
Key Insight: As database size grows, brute-force becomes exponentially slower, while HNSW search time grows only logarithmically. This is why vector databases are essential for production RAG systems.
Example 4: Incremental Updates to Vector Database
Scenario: Adding new documents to an existing vector database without rebuilding the entire index.
Initial State:
- Database has 100,000 documents indexed
- HNSW index is built and optimized
- Query time: ~50ms
New Documents Arrive:
- 1,000 new documents need to be added
- Each document is embedded: [0.45, -0.23, 0.67, ...]
Incremental Update Process:
Step 1: Embed new documents
- Generate embeddings for 1,000 new documents
- Time: ~10-30 seconds (depending on embedding model)
Step 2: Insert into HNSW index
- For each new vector, find its position in the graph
- Connect it to nearest neighbors in each layer
- Time: ~5-10 seconds for 1,000 vectors
- ✅ No need to rebuild entire index!
Step 3: Verify
- Database now has 101,000 documents
- Query time: ~52ms (slightly slower, but still fast)
- New documents are immediately searchable
Comparison to Full Rebuild:
- Incremental update: ~15-40 seconds total
- Full rebuild: ~10-30 minutes (would require rebuilding entire index)
- ✅ Incremental updates are 15-120x faster!
Example 5: Choosing Between Vector Database Options
Scenario: You need to choose a vector database for your RAG system. Here's how different options compare for a specific use case.
Use Case: 5 million documents, need sub-100ms query time, cloud deployment preferred
Option 1: Pinecone (Managed Cloud)
- ✅ Setup: 5 minutes (just create account, API key)
- ✅ Scaling: Automatic, handles 5M+ documents easily
- ✅ Performance: ~50ms query time
- ✅ Maintenance: Zero (fully managed)
- ❌ Cost: ~$70-200/month for 5M vectors
- ❌ Vendor lock-in: Data stored in Pinecone's cloud
- Best for: Quick deployment, teams without DevOps resources
Option 2: Chroma (Self-Hosted)
- ✅ Setup: 30-60 minutes (install, configure, deploy)
- ✅ Scaling: Manual (need to set up infrastructure)
- ✅ Performance: ~60-80ms query time
- ✅ Cost: ~$20-50/month (server costs only)
- ✅ Control: Full control over data and infrastructure
- ❌ Maintenance: You manage servers, updates, backups
- Best for: Cost-sensitive, need data control, have DevOps team
Option 3: FAISS (Library, Self-Hosted)
- ✅ Setup: 1-2 hours (integrate into your application)
- ✅ Performance: ~40-60ms query time (very fast)
- ✅ Cost: Server costs only
- ✅ Flexibility: Full control, can customize
- ❌ Scaling: Manual, need to implement sharding yourself
- ❌ Features: No built-in metadata filtering, need to implement yourself
- Best for: Research, maximum performance, full customization needed
Recommendation for this use case: Pinecone for fastest deployment, Chroma for cost savings, FAISS for maximum performance and control.
Implementation
Implementation Overview
This section provides practical Python code examples for working with vector databases in RAG systems. The examples demonstrate how to set up, index documents, query, and manage vector databases using popular options like Chroma, Pinecone, and FAISS.
1. Chroma Vector Database - Complete Setup
What this does: Sets up Chroma, indexes documents with embeddings, and performs similarity search. Chroma is easy to use and good for self-hosted solutions.
import chromadb
from chromadb.config import Settings
from sentence_transformers import SentenceTransformer
import numpy as np
class ChromaRAG:
"""
Complete RAG implementation using Chroma vector database.
This class handles:
1. Document indexing with embeddings
2. Query processing and retrieval
3. Metadata filtering
"""
def __init__(self, collection_name="documents", persist_directory="./chroma_db"):
"""
Initialize Chroma client and collection.
Args:
collection_name: Name of the collection to create/use
persist_directory: Directory to persist data (for PersistentClient)
"""
# Use PersistentClient for production (saves to disk)
self.client = chromadb.PersistentClient(path=persist_directory)
# Get or create collection
self.collection = self.client.get_or_create_collection(
name=collection_name,
metadata={"hnsw:space": "cosine"} # Use cosine similarity
)
# Initialize embedding model
self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
print(f"Initialized Chroma collection: {collection_name}")
def add_documents(self, documents, ids=None, metadatas=None):
"""
Add documents to the vector database.
Args:
documents: List of document text strings
ids: Optional list of document IDs (auto-generated if None)
metadatas: Optional list of metadata dictionaries
"""
if not documents:
raise ValueError("Documents list cannot be empty")
# Generate IDs if not provided
if ids is None:
ids = [f"doc_{i}" for i in range(len(documents))]
# Generate embeddings
print(f"Generating embeddings for {len(documents)} documents...")
embeddings = self.embedder.encode(documents, show_progress_bar=True)
embeddings_list = embeddings.tolist()
# Prepare metadatas (ensure all have same keys)
if metadatas is None:
metadatas = [{}] * len(documents)
# Add to collection
self.collection.add(
documents=documents,
embeddings=embeddings_list,
ids=ids,
metadatas=metadatas
)
print(f"Added {len(documents)} documents to collection")
print(f"Collection now has {self.collection.count()} total documents")
def query(self, query_text, n_results=5, where_filter=None):
"""
Query the vector database for similar documents.
Args:
query_text: User query string
n_results: Number of results to return
where_filter: Optional metadata filter (e.g., {"year": 2024})
Returns:
Dictionary with 'documents', 'ids', 'distances', 'metadatas'
"""
# Generate query embedding
query_embedding = self.embedder.encode([query_text])
# Build query
query_kwargs = {
"query_embeddings": query_embedding.tolist(),
"n_results": n_results
}
# Add metadata filter if provided
if where_filter:
query_kwargs["where"] = where_filter
# Execute query
results = self.collection.query(**query_kwargs)
return {
'documents': results['documents'][0],
'ids': results['ids'][0],
'distances': results['distances'][0],
'metadatas': results['metadatas'][0]
}
def update_document(self, doc_id, new_text, new_metadata=None):
"""
Update an existing document.
Args:
doc_id: ID of document to update
new_text: New document text
new_metadata: New metadata (optional)
"""
# Generate new embedding
new_embedding = self.embedder.encode([new_text])[0].tolist()
# Update in collection
self.collection.update(
ids=[doc_id],
documents=[new_text],
embeddings=[new_embedding],
metadatas=[new_metadata] if new_metadata else None
)
print(f"Updated document: {doc_id}")
def delete_documents(self, doc_ids):
"""Delete documents by IDs."""
self.collection.delete(ids=doc_ids)
print(f"Deleted {len(doc_ids)} documents")
# Example usage
rag = ChromaRAG(collection_name="knowledge_base")
# Add documents with metadata
documents = [
"Machine learning is a subset of AI that learns from data.",
"Deep learning uses neural networks with multiple layers.",
"Python is a popular programming language for data science."
]
metadatas = [
{"topic": "machine_learning", "year": 2024, "category": "AI"},
{"topic": "deep_learning", "year": 2024, "category": "AI"},
{"topic": "programming", "year": 2024, "category": "languages"}
]
ids = ["doc_ml", "doc_dl", "doc_python"]
rag.add_documents(documents, ids=ids, metadatas=metadatas)
# Query without filter
results = rag.query("What is machine learning?", n_results=2)
print("\nQuery results:")
for i, (doc, distance) in enumerate(zip(results['documents'], results['distances']), 1):
print(f"{i}. {doc} (distance: {distance:.3f})")
# Query with metadata filter
filtered_results = rag.query(
"What is machine learning?",
n_results=2,
where_filter={"category": "AI"} # Only search in AI category
)
print("\nFiltered results (AI category only):")
for i, doc in enumerate(filtered_results['documents'], 1):
print(f"{i}. {doc}")
Key Points:
- Persistent storage: Uses PersistentClient to save data to disk (survives restarts)
- Automatic indexing: Chroma automatically builds HNSW index for fast search
- Metadata filtering: Can filter by metadata before similarity search
- Update support: Can update documents without rebuilding entire index
- Distance vs similarity: Chroma returns distances (lower = more similar), not similarity scores
2. Pinecone Vector Database - Cloud Deployment
What this does: Sets up Pinecone (managed cloud service), indexes documents, and performs queries. Ideal for production deployments without infrastructure management.
import pinecone
from sentence_transformers import SentenceTransformer
import os
class PineconeRAG:
"""
RAG implementation using Pinecone (managed cloud vector database).
Pinecone handles infrastructure, scaling, and optimization automatically.
Good for production systems where you want minimal DevOps overhead.
"""
def __init__(self, api_key, environment, index_name="rag-index"):
"""
Initialize Pinecone connection.
Args:
api_key: Pinecone API key (get from pinecone.io)
environment: Pinecone environment (e.g., "us-west1-gcp")
index_name: Name of the index to create/use
"""
# Initialize Pinecone
pinecone.init(api_key=api_key, environment=environment)
# Get or create index
if index_name not in pinecone.list_indexes():
# Create index with specifications
pinecone.create_index(
index_name,
dimension=384, # Match embedding dimension
metric="cosine", # Use cosine similarity
metadata_config={"indexed": ["category", "year", "source"]} # Indexed metadata for filtering
)
print(f"Created new index: {index_name}")
else:
print(f"Using existing index: {index_name}")
self.index = pinecone.Index(index_name)
self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
print(f"Connected to Pinecone index: {index_name}")
def add_documents(self, documents, ids=None, metadatas=None):
"""
Add documents to Pinecone index.
Args:
documents: List of document strings
ids: Optional list of IDs
metadatas: Optional list of metadata dicts
"""
if ids is None:
ids = [f"doc_{i}" for i in range(len(documents))]
if metadatas is None:
metadatas = [{}] * len(documents)
# Generate embeddings
embeddings = self.embedder.encode(documents, show_progress_bar=True)
# Prepare vectors for upsert (Pinecone format)
vectors = []
for i, (doc_id, embedding, metadata) in enumerate(zip(ids, embeddings, metadatas)):
vectors.append({
"id": doc_id,
"values": embedding.tolist(),
"metadata": {**metadata, "text": documents[i]} # Store text in metadata
})
# Upsert in batches (Pinecone supports batch operations)
batch_size = 100
for i in range(0, len(vectors), batch_size):
batch = vectors[i:i + batch_size]
self.index.upsert(vectors=batch)
print(f"Added {len(documents)} documents to Pinecone")
print(f"Index stats: {self.index.describe_index_stats()}")
def query(self, query_text, top_k=5, filter_dict=None):
"""
Query Pinecone index.
Args:
query_text: User query
top_k: Number of results
filter_dict: Optional metadata filter (e.g., {"category": "AI"})
Returns:
Query results with documents, scores, and metadata
"""
# Generate query embedding
query_embedding = self.embedder.encode([query_text])[0].tolist()
# Build query
query_kwargs = {
"vector": query_embedding,
"top_k": top_k,
"include_metadata": True
}
# Add filter if provided
if filter_dict:
query_kwargs["filter"] = filter_dict
# Execute query
results = self.index.query(**query_kwargs)
# Format results
formatted_results = []
for match in results['matches']:
formatted_results.append({
'id': match['id'],
'score': match['score'], # Pinecone returns similarity scores (higher = more similar)
'text': match['metadata'].get('text', ''),
'metadata': {k: v for k, v in match['metadata'].items() if k != 'text'}
})
return formatted_results
# Example usage
# Initialize (requires Pinecone API key)
# rag = PineconeRAG(
# api_key=os.getenv("PINECONE_API_KEY"),
# environment="us-west1-gcp",
# index_name="rag-tutorial"
# )
#
# # Add documents
# documents = ["Machine learning is...", "Deep learning uses...", ...]
# rag.add_documents(documents, metadatas=[{"category": "AI"}, ...])
#
# # Query
# results = rag.query("What is machine learning?", top_k=3)
# for result in results:
# print(f"Score: {result['score']:.3f}")
# print(f"Text: {result['text']}")
# print(f"Metadata: {result['metadata']}\n")
Key Points:
- Managed service: No infrastructure to manage - Pinecone handles everything
- Automatic scaling: Handles millions of vectors without configuration
- Metadata indexing: Can index specific metadata fields for fast filtering
- Batch operations: Efficient batch upsert for large document sets
- Cost: Pay-per-use pricing, good for production but has ongoing costs
3. FAISS - High-Performance Self-Hosted Solution
What this does: Sets up FAISS (Facebook AI Similarity Search) for maximum performance and control. Best for research, self-hosted solutions, or when you need fine-grained control.
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
class FAISSRAG:
"""
RAG implementation using FAISS for vector similarity search.
FAISS provides maximum performance and flexibility, but requires
more setup and infrastructure management than managed services.
"""
def __init__(self, dimension=384, index_type="L2"):
"""
Initialize FAISS index.
Args:
dimension: Embedding dimension (384 for all-MiniLM-L6-v2)
index_type: "L2" (Euclidean) or "IP" (Inner Product/Cosine for normalized)
"""
self.dimension = dimension
self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
# Create FAISS index
# IndexFlatL2: Exact search (slower but accurate)
# IndexIVFFlat: Approximate search (faster, good for large datasets)
# For production, use IndexIVFFlat or IndexHNSWFlat for speed
# Simple exact search index (for small datasets)
if index_type == "L2":
self.index = faiss.IndexFlatL2(dimension)
else: # Inner product (for normalized embeddings = cosine similarity)
self.index = faiss.IndexFlatIP(dimension)
# Store document texts and metadata
self.documents = []
self.metadatas = []
print(f"Initialized FAISS index: {index_type}, dimension: {dimension}")
def add_documents(self, documents, metadatas=None):
"""
Add documents to FAISS index.
Args:
documents: List of document strings
metadatas: Optional list of metadata dictionaries
"""
if not documents:
raise ValueError("Documents list cannot be empty")
# Generate embeddings
embeddings = self.embedder.encode(documents, show_progress_bar=True)
# Normalize embeddings for cosine similarity (if using IP index)
if isinstance(self.index, faiss.IndexFlatIP):
faiss.normalize_L2(embeddings) # Normalize for cosine similarity
# Convert to numpy array (float32 for FAISS)
embeddings = np.array(embeddings).astype('float32')
# Add to index
self.index.add(embeddings)
# Store documents and metadata
self.documents.extend(documents)
if metadatas:
self.metadatas.extend(metadatas)
else:
self.metadatas.extend([{}] * len(documents))
print(f"Added {len(documents)} documents. Index now has {self.index.ntotal} vectors")
def search(self, query_text, top_k=5):
"""
Search for similar documents.
Args:
query_text: User query
top_k: Number of results
Returns:
List of (document, distance, metadata) tuples
"""
if self.index.ntotal == 0:
raise ValueError("Index is empty. Add documents first.")
# Generate query embedding
query_embedding = self.embedder.encode([query_text])
# Normalize if using IP index
if isinstance(self.index, faiss.IndexFlatIP):
query_embedding = query_embedding.astype('float32')
faiss.normalize_L2(query_embedding)
else:
query_embedding = query_embedding.astype('float32')
# Search
distances, indices = self.index.search(query_embedding, top_k)
# Format results
results = []
for i, (distance, idx) in enumerate(zip(distances[0], indices[0])):
if idx < len(self.documents): # Valid index
results.append({
'document': self.documents[idx],
'distance': float(distance),
'similarity': 1.0 - float(distance) if isinstance(self.index, faiss.IndexFlatL2) else float(distance),
'metadata': self.metadatas[idx] if idx < len(self.metadatas) else {}
})
return results
def save_index(self, filepath):
"""Save FAISS index to disk."""
faiss.write_index(self.index, filepath)
print(f"Saved index to {filepath}")
def load_index(self, filepath):
"""Load FAISS index from disk."""
self.index = faiss.read_index(filepath)
print(f"Loaded index from {filepath}")
# Example usage
rag = FAISSRAG(dimension=384, index_type="IP") # IP = Inner Product (cosine for normalized)
# Add documents
documents = [
"Machine learning algorithms learn from data.",
"Deep learning uses neural networks.",
"Python is a programming language."
]
rag.add_documents(documents, metadatas=[
{"topic": "ML"}, {"topic": "DL"}, {"topic": "programming"}
])
# Search
results = rag.search("What is machine learning?", top_k=2)
for result in results:
print(f"Similarity: {result['similarity']:.3f}")
print(f"Document: {result['document']}\n")
# Save for later use
rag.save_index("faiss_index.bin")
Key Points:
- Maximum performance: FAISS is highly optimized C++ code, very fast
- Full control: You control indexing, storage, and search parameters
- Index types: Choose between exact (IndexFlat) or approximate (IndexIVF, IndexHNSW) search
- Normalization: For cosine similarity with IP index, normalize embeddings first
- Persistence: Can save/load indexes to disk
- Use case: Best for research, maximum performance needs, or when you want full control
4. Metadata Filtering Implementation
What this does: Demonstrates how to combine vector similarity search with metadata filtering for precise retrieval.
import chromadb
from sentence_transformers import SentenceTransformer
class FilteredVectorSearch:
"""
Vector search with metadata filtering capabilities.
Filters documents by metadata before or after similarity search,
enabling precise retrieval based on document properties.
"""
def __init__(self):
self.client = chromadb.Client()
self.collection = self.client.create_collection(
name="filtered_docs",
metadata={"hnsw:space": "cosine"}
)
self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
def add_documents_with_metadata(self, documents, metadatas):
"""
Add documents with rich metadata for filtering.
Args:
documents: List of document texts
metadatas: List of metadata dicts with filterable fields
"""
embeddings = self.embedder.encode(documents).tolist()
ids = [f"doc_{i}" for i in range(len(documents))]
self.collection.add(
documents=documents,
embeddings=embeddings,
ids=ids,
metadatas=metadatas
)
def search_with_filters(self, query, top_k=5, filters=None):
"""
Search with metadata filters.
Args:
query: Search query
top_k: Number of results
filters: Metadata filter dict (e.g., {"year": 2024, "category": "AI"})
Returns:
Filtered search results
"""
query_embedding = self.embedder.encode([query]).tolist()
query_kwargs = {
"query_embeddings": query_embedding,
"n_results": top_k
}
# Add metadata filters
if filters:
# Chroma supports where clauses
where_clause = {}
for key, value in filters.items():
if isinstance(value, list):
where_clause[key] = {"$in": value} # IN operator
elif isinstance(value, dict) and "$gte" in value:
where_clause[key] = value # Range operators
else:
where_clause[key] = value # Equality
query_kwargs["where"] = where_clause
results = self.collection.query(**query_kwargs)
return results
# Example: Filtered search
search = FilteredVectorSearch()
# Add documents with metadata
documents = [
"Machine learning tutorial for beginners",
"Advanced deep learning techniques",
"Python programming guide",
"Machine learning research paper 2024"
]
metadatas = [
{"category": "tutorial", "year": 2023, "difficulty": "beginner"},
{"category": "tutorial", "year": 2024, "difficulty": "advanced"},
{"category": "programming", "year": 2023, "difficulty": "beginner"},
{"category": "research", "year": 2024, "difficulty": "advanced"}
]
search.add_documents_with_metadata(documents, metadatas)
# Search with filters
results = search.search_with_filters(
"machine learning",
top_k=3,
filters={"year": 2024, "category": "tutorial"} # Only 2024 tutorials
)
print("Filtered results (2024 tutorials only):")
for doc in results['documents'][0]:
print(f"- {doc}")
Installation Requirements
Install the required packages based on your vector database choice:
# For Chroma
pip install chromadb sentence-transformers
# For Pinecone
pip install pinecone-client sentence-transformers
# For FAISS
pip install faiss-cpu sentence-transformers # CPU version
# or: pip install faiss-gpu sentence-transformers # GPU version (faster)
Real-World Applications
Chunking in RAG Systems
Document processing: Split large documents into manageable chunks for embedding and retrieval
Context management: Ensure chunks fit within LLM context windows
Retrieval optimization: Smaller, focused chunks improve retrieval precision
Storage efficiency: Balance between chunk size and storage costs
Best Practices
Chunk size: 200-1000 tokens (depends on embedding model and LLM context)
Overlap: 10-20% of chunk size
Strategy: Use semantic chunking when possible, fallback to sentence-based, then fixed-size
Testing: Evaluate retrieval quality with different chunk sizes