Chapter 4: Vector Databases

Storing and Searching Embeddings

Learning Objectives

  • Understand vector databases fundamentals
  • Master the mathematical foundations
  • Learn practical implementation
  • Apply knowledge through examples
  • Recognize real-world applications

Vector Databases

Why Vector Databases are Essential for RAG

Once you've created embeddings for your documents, you need to store them somewhere and search through them efficiently. A traditional database (like PostgreSQL or MySQL) is designed for exact matches and structured queries, not for finding "similar" vectors. Vector databases are specialized databases optimized for storing and searching high-dimensional vectors using similarity metrics like cosine similarity.

The scale problem: In production RAG systems, you might have millions or billions of document chunks, each with a 384-1536 dimensional embedding vector. Searching through all of them to find the most similar to a query vector would take hours using brute-force methods. Vector databases use sophisticated indexing algorithms (like HNSW, IVF, or LSH) to enable sub-millisecond similarity search even across billions of vectors.

What Vector Databases Provide

  • Fast Similarity Search: Find top-k most similar vectors in milliseconds, even with millions of documents
  • Scalable Storage: Efficiently store and index billions of high-dimensional vectors
  • Metadata Filtering: Combine vector similarity search with traditional filters (date, category, author, etc.)
  • Real-time Updates: Add, update, or delete vectors without rebuilding entire indexes
  • Approximate Nearest Neighbor (ANN): Trade some accuracy for massive speed improvements (1000-10000x faster than exact search)
Performance Comparison:

Brute-force search (NumPy): To find top-5 most similar vectors among 1 million documents:

  • ❌ Compute 1 million cosine similarities: ~16 minutes
  • ❌ Sort results: Additional time
  • ❌ Total: ~16+ minutes per query (completely impractical)

Vector database (HNSW index): Same task:

  • ✅ Find top-5 similar vectors: 50-200 milliseconds
  • ✅ Speedup: ~5,000-20,000x faster!
  • ✅ Enables real-time RAG systems

Key Concepts You'll Learn

  • Indexing Algorithms: HNSW (Hierarchical Navigable Small World), IVF (Inverted File Index), Product Quantization, and LSH - how they work and when to use each
  • Vector Database Options: Pinecone, Weaviate, Chroma, Qdrant, FAISS, Milvus - comparing features, performance, and use cases
  • Metadata Filtering: Combining vector similarity search with traditional database filters for precise retrieval
  • Exact vs Approximate Search: Understanding the trade-offs between accuracy and speed
  • Scaling Strategies: Sharding, distributed indexing, and optimization techniques for billion-scale systems
  • Production Considerations: Update strategies, index maintenance, and performance monitoring

Why this matters: Vector databases are the infrastructure that makes RAG systems practical at scale. Without them, you're limited to small document collections or unacceptably slow query times. Choosing the right vector database and indexing strategy directly impacts your RAG system's performance, cost, and scalability.

Key Concepts

Vector Database Indexing Strategies: How Fast Retrieval Works

Vector databases use sophisticated indexing algorithms to enable fast similarity search across millions of vectors. Understanding these indexing strategies is crucial for choosing the right database and optimizing performance.

1. HNSW (Hierarchical Navigable Small World)

What it is: HNSW is one of the most popular and effective indexing algorithms for approximate nearest neighbor (ANN) search. It builds a multi-layer graph where each layer is a subset of the previous layer, creating a "small world" network that allows efficient navigation.

How it works:

  • Multi-layer structure: Creates multiple layers (levels) of graphs, with the top layer having few nodes and bottom layer having all nodes
  • Greedy search: Starts at the top layer, finds the nearest neighbor, then moves to the next layer and continues
  • Small world property: Each node is connected to a small number of "long-range" connections, allowing fast navigation across the graph
  • Dynamic insertion: New vectors can be added without rebuilding the entire index

Advantages:

  • Very fast: Sub-millisecond search times even for millions of vectors
  • High accuracy: Can achieve 95%+ recall (finds 95% of true nearest neighbors)
  • Scalable: Works well with billions of vectors
  • Used by major databases: Pinecone, Weaviate, Qdrant all use HNSW variants

Trade-offs:

  • ⚠️ Memory intensive: Stores the graph structure in memory (though can be optimized)
  • ⚠️ Indexing time: Building the index takes time, though it's a one-time cost

When to use: For production RAG systems where speed and accuracy are critical. This is the default choice for most vector databases.

2. IVF (Inverted File Index)

What it is: IVF partitions the vector space into clusters (Voronoi cells) and creates an inverted index mapping each cluster to its vectors.

How it works:

  • Clustering: Uses k-means or similar to partition vectors into clusters
  • Inverted index: For each cluster, stores a list of vectors belonging to that cluster
  • Search: Finds the nearest cluster(s) to the query vector, then searches only within those clusters

Advantages:

  • Memory efficient: Lower memory footprint than HNSW
  • Fast for large datasets: Only searches relevant clusters, not all vectors
  • Used by FAISS: FAISS's IVF-Flat and IVF-PQ use this approach

Trade-offs:

  • ⚠️ Lower accuracy: May miss vectors near cluster boundaries
  • ⚠️ Requires tuning: Number of clusters needs careful selection

When to use: For very large datasets (billions of vectors) where memory is a constraint, or when using FAISS.

3. Product Quantization (PQ)

What it is: A compression technique that reduces vector storage by quantizing (discretizing) vector components into a smaller number of values.

How it works:

  • Vector splitting: Splits each vector into multiple sub-vectors
  • Quantization: Each sub-vector is mapped to a "codebook" (set of representative vectors)
  • Compression: Stores only the codebook indices, not full vectors
  • Fast distance: Uses lookup tables for fast approximate distance calculations

Advantages:

  • Massive storage reduction: Can reduce storage by 10-100x
  • Fast search: Approximate distances computed quickly using lookup tables
  • Enables billion-scale: Makes it feasible to store billions of vectors

Trade-offs:

  • ⚠️ Accuracy loss: Compression introduces approximation errors
  • ⚠️ Training required: Codebooks need to be trained on representative data

When to use: For extremely large datasets where storage is a primary concern. Often combined with IVF (IVF-PQ).

4. LSH (Locality-Sensitive Hashing)

What it is: Uses hash functions that map similar vectors to the same hash buckets, enabling fast approximate search.

How it works:

  • Hash functions: Creates multiple hash functions that preserve similarity (similar vectors hash to same bucket)
  • Bucketing: Vectors are placed into hash buckets
  • Search: Query vector is hashed, then only vectors in the same bucket(s) are searched

Advantages:

  • Very fast: Constant-time hash lookup
  • Simple: Easy to understand and implement

Trade-offs:

  • ⚠️ Lower accuracy: May miss some similar vectors
  • ⚠️ Parameter tuning: Number of hash functions and buckets needs tuning

When to use: For very fast, approximate search when some accuracy loss is acceptable. Less common in modern RAG systems.

Metadata Filtering: Combining Vector Search with Traditional Filters

Real-world RAG systems often need to filter documents by metadata (date, author, category, etc.) in addition to semantic similarity. Vector databases support this through metadata filtering.

How Metadata Filtering Works

Two-stage process:

  1. Filter first: Apply metadata filters to reduce the search space (e.g., "only documents from 2023")
  2. Search in filtered set: Perform vector similarity search only on the filtered documents

Example:

Query with Metadata Filtering

Query: "machine learning best practices"

Metadata filters:

  • Category = "Technical Blog"
  • Date >= "2023-01-01"
  • Author = "John Doe"

Process:

  1. Filter 1M documents → 50K documents matching metadata
  2. Vector search in 50K documents → Top 5 most similar

✅ Much faster than searching all 1M documents!

Types of Metadata Filters

  • Equality filters: author = "John Doe"
  • Range filters: date >= "2023-01-01" AND date <= "2023-12-31"
  • In filters: category IN ["Tech", "Science"]
  • Boolean combinations: (category = "Tech" OR category = "Science") AND date >= "2023"

Benefits of Metadata Filtering

  • Faster search: Reduces the number of vectors to search
  • More relevant results: Ensures results match business constraints
  • Better user experience: Users can narrow down by date, source, etc.
  • Compliance: Can filter by access permissions, data retention policies

Vector Database Operations: Indexing, Querying, and Maintenance

1. Indexing (One-Time Setup)

What happens: When you add documents to a vector database, they go through an indexing process:

  1. Embedding generation: Each document chunk is converted to a vector using an embedding model
  2. Metadata extraction: Extract and store metadata (title, date, author, etc.)
  3. Index building: Vector is inserted into the index structure (HNSW, IVF, etc.)
  4. Storage: Vector, metadata, and original text are stored

Performance considerations:

  • ⚠️ Indexing is slow: Building indexes takes time (minutes to hours for large datasets)
  • ⚠️ Batch processing: More efficient to index in batches rather than one-by-one
  • ⚠️ Incremental updates: Some databases support adding vectors without rebuilding (HNSW), others require full rebuild (IVF)

2. Querying (Per-Request)

What happens: When a user query arrives:

  1. Query embedding: Convert query to vector using the same embedding model
  2. Metadata filtering (optional): Apply metadata filters to reduce search space
  3. Vector search: Use the index to find top-k most similar vectors
  4. Result retrieval: Return document IDs, metadata, and similarity scores
  5. Document fetching: Retrieve actual document text using IDs

Performance considerations:

  • Very fast: Sub-100ms for millions of vectors with good indexes
  • Scalable: Query time grows slowly with dataset size (logarithmic for HNSW)
  • ⚠️ First query slower: May need to load index into memory

3. Maintenance Operations

Index updates: When documents are added, updated, or deleted:

  • Add: Insert new vector into index (fast for HNSW, may require rebuild for IVF)
  • Update: Delete old vector, insert new one (or update in-place if supported)
  • Delete: Remove vector from index (mark as deleted or physically remove)

Index optimization:

  • Rebuilding: Periodically rebuild index to optimize structure (especially for IVF)
  • Compaction: Remove deleted vectors and optimize storage
  • Monitoring: Track index size, query performance, accuracy metrics

Mathematical Formulations

Vector Database Performance Metrics

Vector databases use sophisticated indexing algorithms to enable fast similarity search. Understanding the mathematical foundations helps you choose the right database, configure indexes, and optimize performance. These formulas describe how vector databases achieve sub-millisecond search times even with millions of vectors.

1. HNSW Search Complexity

\[T_{\text{search}} = O(\log N)\]
What This Represents:

HNSW (Hierarchical Navigable Small World) achieves logarithmic search time complexity, meaning search time grows very slowly as the number of vectors increases. This is why it can search millions of vectors in milliseconds.

Breaking It Down:
  • \(T_{\text{search}}\): Time complexity of search operation
  • \(O(\log N)\): Big-O notation indicating logarithmic time complexity
  • \(N\): Number of vectors in the database
What Logarithmic Means:

If you double the number of vectors, search time increases by a constant amount (not doubled). For example:

  • 1,000 vectors: ~10 operations
  • 10,000 vectors: ~13 operations (only 30% more!)
  • 1,000,000 vectors: ~20 operations (only 100% more for 1000x more data!)
Comparison to Brute-Force:

Brute-force: \(O(N)\) - linear time. To search 1M vectors, you must compare query with all 1M vectors.

HNSW: \(O(\log N)\) - logarithmic time. To search 1M vectors, you only need ~20 comparisons by navigating the graph structure.

Why HNSW is Fast:

HNSW builds a multi-layer graph where each layer has fewer nodes. Search starts at the top (few nodes), finds approximate location, then refines in lower layers. This hierarchical approach dramatically reduces comparisons needed.

2. IVF Cluster Search Reduction

\[T_{\text{search}} = O(\sqrt{N} + k)\]
What This Represents:

IVF (Inverted File Index) partitions vectors into clusters. Instead of searching all \(N\) vectors, you only search within the nearest cluster(s), dramatically reducing search space.

Breaking It Down:
  • \(O(\sqrt{N})\): Time to find the nearest cluster(s) - grows with square root of N
  • \(O(k)\): Time to search within the cluster(s) - constant or linear in cluster size
  • Total: Much faster than \(O(N)\) brute-force search
How It Works:
  1. Partition all vectors into \(\sqrt{N}\) clusters using k-means
  2. For a query, find the nearest cluster(s): \(O(\sqrt{N})\) operations
  3. Search only within those clusters: \(O(k)\) where k is cluster size
  4. Total: \(O(\sqrt{N} + k)\) instead of \(O(N)\)
Example:

1,000,000 vectors partitioned into 1,000 clusters (1,000 vectors per cluster):

  • Brute-force: Compare with all 1,000,000 vectors
  • IVF: Find nearest cluster (1,000 comparisons) + search within cluster (1,000 comparisons) = 2,000 total comparisons
  • Speedup: 500x faster!
Trade-off:

IVF is faster than brute-force but may miss vectors near cluster boundaries. HNSW is more accurate but uses more memory. Choose based on your accuracy vs speed requirements.

3. Product Quantization Compression Ratio

\[\text{compression\_ratio} = \frac{\text{original\_size}}{\text{compressed\_size}} = \frac{d \times 4 \text{ bytes}}{m \times \log_2(k) \text{ bits}}\]
What This Measures:

Product Quantization (PQ) compresses vectors by quantizing sub-vectors into codebooks. This formula calculates the compression ratio achieved.

Breaking It Down:
  • \(d\): Original vector dimension (e.g., 384)
  • \(4 \text{ bytes}\): Size per float32 value (original storage)
  • \(m\): Number of sub-vectors (e.g., 8 sub-vectors of 48 dimensions each)
  • \(k\): Codebook size (number of quantization levels, e.g., 256)
  • \(\log_2(k) \text{ bits}\): Bits needed to store codebook index (e.g., \(\log_2(256) = 8\) bits = 1 byte)
Example:

Original: 384-dimensional vector = 384 × 4 bytes = 1,536 bytes
PQ compressed: 8 sub-vectors × 1 byte = 8 bytes
Compression ratio: \(\frac{1536}{8} = 192x\) reduction!

Trade-off:

Higher compression = less storage but some accuracy loss. Typical PQ achieves 10-100x compression with minimal accuracy degradation (95%+ recall maintained).

4. Vector Database Query Time

\[T_{\text{query}} = T_{\text{embed}} + T_{\text{search}} + T_{\text{retrieve}}\]
What This Represents:

Total query time in a RAG system is the sum of embedding generation, vector search, and document retrieval times. Understanding this breakdown helps you optimize each component.

Breaking It Down:
  • \(T_{\text{embed}}\): Time to convert query text to embedding vector (typically 10-50ms for local models, 50-200ms for API calls)
  • \(T_{\text{search}}\): Time to find top-k similar vectors in the database (typically 10-100ms with HNSW for millions of vectors)
  • \(T_{\text{retrieve}}\): Time to fetch actual document text using retrieved IDs (typically 1-10ms if documents are cached)
Typical Breakdown:

For a query in a system with 1M documents:

  • Embedding: 20ms (local model) or 100ms (API)
  • Vector search: 50ms (HNSW index)
  • Document retrieval: 5ms (cached)
  • Total: 75ms (local) or 155ms (API)
Optimization Strategies:
  • Reduce \(T_{\text{embed}}\): Cache query embeddings for common queries, use faster embedding models
  • Reduce \(T_{\text{search}}\): Use efficient indexes (HNSW), limit search space with metadata filters
  • Reduce \(T_{\text{retrieve}}\): Cache documents in memory, use fast storage (SSD, in-memory cache)

5. Index Build Time

\[T_{\text{build}} = O(N \log N)\]
What This Represents:

Building an HNSW index takes \(O(N \log N)\) time, where \(N\) is the number of vectors. This is a one-time cost when indexing documents, but it's important to understand for planning indexing operations.

Breaking It Down:
  • \(N\): Number of vectors to index
  • \(O(N \log N)\): Time complexity - grows faster than linear but slower than quadratic
  • For each vector, the algorithm needs to find its position in the graph structure
Practical Times:
  • 10,000 vectors: ~1-5 seconds
  • 100,000 vectors: ~30-120 seconds (1-2 minutes)
  • 1,000,000 vectors: ~10-30 minutes
  • 10,000,000 vectors: ~2-8 hours
Strategies for Large Datasets:
  • Batch indexing: Index in batches rather than one-by-one
  • Incremental updates: Use databases that support adding vectors without full rebuild (HNSW supports this)
  • Parallel indexing: Use multiple CPU cores to speed up index building
  • Background indexing: Build index in background while serving queries from old index

Detailed Examples

Example 1: Vector Database Indexing and Querying - Complete Workflow

Scenario: Setting up a vector database for a RAG system with 10,000 document chunks.

Step 1: Document Preparation

You have 10,000 document chunks, each already embedded:

  • Chunk 1: "Machine learning is..." → Embedding: [0.45, -0.23, 0.67, ...] (384-dim)
  • Chunk 2: "Deep learning uses..." → Embedding: [0.48, -0.25, 0.65, ...]
  • ... (9,998 more chunks)

Step 2: Index Building

  • All 10,000 embeddings are inserted into the vector database
  • HNSW index is built (takes ~2-5 minutes for 10K vectors)
  • Index creates a multi-layer graph structure for fast search
  • Total storage: 10,000 × 384 × 4 bytes = ~15 MB (just for embeddings)

Step 3: Query Processing

User query: "What is neural network training?"

  • Query embedding: [0.46, -0.24, 0.66, ...]
  • Vector database searches the HNSW index
  • Finds top-5 most similar vectors in ~10-50 milliseconds
  • Returns document IDs: [doc_123, doc_456, doc_789, doc_234, doc_567]

Step 4: Document Retrieval

  • Using document IDs, fetch actual text from storage
  • Returns: ["Neural networks are trained using...", "Training involves...", ...]
  • Total query time: ~50-100ms (embedding + search + retrieval)

Performance: Without vector database (brute-force), this would take ~16 seconds. With HNSW index, it takes ~50ms - a 320x speedup!

Example 2: Metadata Filtering in Action

Scenario: A knowledge base with documents from different years, and you want to filter by date before similarity search.

Knowledge Base: 1,000,000 documents

  • 500,000 documents from 2023
  • 300,000 documents from 2024
  • 200,000 documents from 2025

Query: "Latest machine learning trends"

Without Metadata Filtering:

  • Search all 1,000,000 documents
  • Time: ~200ms
  • Results might include outdated 2023 documents

With Metadata Filtering (date >= 2024):

  • Filter to 500,000 documents (2024 + 2025)
  • Search only in filtered set
  • Time: ~100ms (50% faster!)
  • Results are more recent and relevant

Combined Filter Example:

Query: "Python machine learning tutorials from 2024"

  • Metadata filters: category="tutorial", year=2024, language="Python"
  • Filter from 1M → 50,000 documents
  • Vector search in 50K documents: ~30ms
  • ✅ Much faster and more precise than searching all 1M documents

Example 3: HNSW vs Brute-Force Performance Comparison

Scenario: Comparing search performance for different database sizes.

Test Setup: Find top-5 most similar vectors to a query

1,000 Documents:

  • Brute-force: Compare query with all 1,000 vectors = 1,000 comparisons = ~16ms
  • HNSW: Navigate graph structure = ~10 comparisons = ~2ms
  • Speedup: 8x faster

100,000 Documents:

  • Brute-force: 100,000 comparisons = ~1.6 seconds
  • HNSW: ~15 comparisons = ~5ms
  • Speedup: 320x faster!

1,000,000 Documents:

  • Brute-force: 1,000,000 comparisons = ~16 seconds (unacceptable!)
  • HNSW: ~20 comparisons = ~50ms
  • Speedup: 320x faster!

10,000,000 Documents:

  • Brute-force: 10,000,000 comparisons = ~2.7 minutes (completely impractical!)
  • HNSW: ~25 comparisons = ~100ms
  • Speedup: 1,600x faster!

Key Insight: As database size grows, brute-force becomes exponentially slower, while HNSW search time grows only logarithmically. This is why vector databases are essential for production RAG systems.

Example 4: Incremental Updates to Vector Database

Scenario: Adding new documents to an existing vector database without rebuilding the entire index.

Initial State:

  • Database has 100,000 documents indexed
  • HNSW index is built and optimized
  • Query time: ~50ms

New Documents Arrive:

  • 1,000 new documents need to be added
  • Each document is embedded: [0.45, -0.23, 0.67, ...]

Incremental Update Process:

Step 1: Embed new documents

  • Generate embeddings for 1,000 new documents
  • Time: ~10-30 seconds (depending on embedding model)

Step 2: Insert into HNSW index

  • For each new vector, find its position in the graph
  • Connect it to nearest neighbors in each layer
  • Time: ~5-10 seconds for 1,000 vectors
  • ✅ No need to rebuild entire index!

Step 3: Verify

  • Database now has 101,000 documents
  • Query time: ~52ms (slightly slower, but still fast)
  • New documents are immediately searchable

Comparison to Full Rebuild:

  • Incremental update: ~15-40 seconds total
  • Full rebuild: ~10-30 minutes (would require rebuilding entire index)
  • ✅ Incremental updates are 15-120x faster!

Example 5: Choosing Between Vector Database Options

Scenario: You need to choose a vector database for your RAG system. Here's how different options compare for a specific use case.

Use Case: 5 million documents, need sub-100ms query time, cloud deployment preferred

Option 1: Pinecone (Managed Cloud)

  • ✅ Setup: 5 minutes (just create account, API key)
  • ✅ Scaling: Automatic, handles 5M+ documents easily
  • ✅ Performance: ~50ms query time
  • ✅ Maintenance: Zero (fully managed)
  • ❌ Cost: ~$70-200/month for 5M vectors
  • ❌ Vendor lock-in: Data stored in Pinecone's cloud
  • Best for: Quick deployment, teams without DevOps resources

Option 2: Chroma (Self-Hosted)

  • ✅ Setup: 30-60 minutes (install, configure, deploy)
  • ✅ Scaling: Manual (need to set up infrastructure)
  • ✅ Performance: ~60-80ms query time
  • ✅ Cost: ~$20-50/month (server costs only)
  • ✅ Control: Full control over data and infrastructure
  • ❌ Maintenance: You manage servers, updates, backups
  • Best for: Cost-sensitive, need data control, have DevOps team

Option 3: FAISS (Library, Self-Hosted)

  • ✅ Setup: 1-2 hours (integrate into your application)
  • ✅ Performance: ~40-60ms query time (very fast)
  • ✅ Cost: Server costs only
  • ✅ Flexibility: Full control, can customize
  • ❌ Scaling: Manual, need to implement sharding yourself
  • ❌ Features: No built-in metadata filtering, need to implement yourself
  • Best for: Research, maximum performance, full customization needed

Recommendation for this use case: Pinecone for fastest deployment, Chroma for cost savings, FAISS for maximum performance and control.

Implementation

Implementation Overview

This section provides practical Python code examples for working with vector databases in RAG systems. The examples demonstrate how to set up, index documents, query, and manage vector databases using popular options like Chroma, Pinecone, and FAISS.

1. Chroma Vector Database - Complete Setup

What this does: Sets up Chroma, indexes documents with embeddings, and performs similarity search. Chroma is easy to use and good for self-hosted solutions.

import chromadb
from chromadb.config import Settings
from sentence_transformers import SentenceTransformer
import numpy as np

class ChromaRAG:
    """
    Complete RAG implementation using Chroma vector database.
    
    This class handles:
    1. Document indexing with embeddings
    2. Query processing and retrieval
    3. Metadata filtering
    """
    
    def __init__(self, collection_name="documents", persist_directory="./chroma_db"):
        """
        Initialize Chroma client and collection.
        
        Args:
            collection_name: Name of the collection to create/use
            persist_directory: Directory to persist data (for PersistentClient)
        """
        # Use PersistentClient for production (saves to disk)
        self.client = chromadb.PersistentClient(path=persist_directory)
        
        # Get or create collection
        self.collection = self.client.get_or_create_collection(
            name=collection_name,
            metadata={"hnsw:space": "cosine"}  # Use cosine similarity
        )
        
        # Initialize embedding model
        self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
        print(f"Initialized Chroma collection: {collection_name}")
    
    def add_documents(self, documents, ids=None, metadatas=None):
        """
        Add documents to the vector database.
        
        Args:
            documents: List of document text strings
            ids: Optional list of document IDs (auto-generated if None)
            metadatas: Optional list of metadata dictionaries
        """
        if not documents:
            raise ValueError("Documents list cannot be empty")
        
        # Generate IDs if not provided
        if ids is None:
            ids = [f"doc_{i}" for i in range(len(documents))]
        
        # Generate embeddings
        print(f"Generating embeddings for {len(documents)} documents...")
        embeddings = self.embedder.encode(documents, show_progress_bar=True)
        embeddings_list = embeddings.tolist()
        
        # Prepare metadatas (ensure all have same keys)
        if metadatas is None:
            metadatas = [{}] * len(documents)
        
        # Add to collection
        self.collection.add(
            documents=documents,
            embeddings=embeddings_list,
            ids=ids,
            metadatas=metadatas
        )
        
        print(f"Added {len(documents)} documents to collection")
        print(f"Collection now has {self.collection.count()} total documents")
    
    def query(self, query_text, n_results=5, where_filter=None):
        """
        Query the vector database for similar documents.
        
        Args:
            query_text: User query string
            n_results: Number of results to return
            where_filter: Optional metadata filter (e.g., {"year": 2024})
            
        Returns:
            Dictionary with 'documents', 'ids', 'distances', 'metadatas'
        """
        # Generate query embedding
        query_embedding = self.embedder.encode([query_text])
        
        # Build query
        query_kwargs = {
            "query_embeddings": query_embedding.tolist(),
            "n_results": n_results
        }
        
        # Add metadata filter if provided
        if where_filter:
            query_kwargs["where"] = where_filter
        
        # Execute query
        results = self.collection.query(**query_kwargs)
        
        return {
            'documents': results['documents'][0],
            'ids': results['ids'][0],
            'distances': results['distances'][0],
            'metadatas': results['metadatas'][0]
        }
    
    def update_document(self, doc_id, new_text, new_metadata=None):
        """
        Update an existing document.
        
        Args:
            doc_id: ID of document to update
            new_text: New document text
            new_metadata: New metadata (optional)
        """
        # Generate new embedding
        new_embedding = self.embedder.encode([new_text])[0].tolist()
        
        # Update in collection
        self.collection.update(
            ids=[doc_id],
            documents=[new_text],
            embeddings=[new_embedding],
            metadatas=[new_metadata] if new_metadata else None
        )
        print(f"Updated document: {doc_id}")
    
    def delete_documents(self, doc_ids):
        """Delete documents by IDs."""
        self.collection.delete(ids=doc_ids)
        print(f"Deleted {len(doc_ids)} documents")

# Example usage
rag = ChromaRAG(collection_name="knowledge_base")

# Add documents with metadata
documents = [
    "Machine learning is a subset of AI that learns from data.",
    "Deep learning uses neural networks with multiple layers.",
    "Python is a popular programming language for data science."
]

metadatas = [
    {"topic": "machine_learning", "year": 2024, "category": "AI"},
    {"topic": "deep_learning", "year": 2024, "category": "AI"},
    {"topic": "programming", "year": 2024, "category": "languages"}
]

ids = ["doc_ml", "doc_dl", "doc_python"]

rag.add_documents(documents, ids=ids, metadatas=metadatas)

# Query without filter
results = rag.query("What is machine learning?", n_results=2)
print("\nQuery results:")
for i, (doc, distance) in enumerate(zip(results['documents'], results['distances']), 1):
    print(f"{i}. {doc} (distance: {distance:.3f})")

# Query with metadata filter
filtered_results = rag.query(
    "What is machine learning?",
    n_results=2,
    where_filter={"category": "AI"}  # Only search in AI category
)
print("\nFiltered results (AI category only):")
for i, doc in enumerate(filtered_results['documents'], 1):
    print(f"{i}. {doc}")
Key Points:
  • Persistent storage: Uses PersistentClient to save data to disk (survives restarts)
  • Automatic indexing: Chroma automatically builds HNSW index for fast search
  • Metadata filtering: Can filter by metadata before similarity search
  • Update support: Can update documents without rebuilding entire index
  • Distance vs similarity: Chroma returns distances (lower = more similar), not similarity scores

2. Pinecone Vector Database - Cloud Deployment

What this does: Sets up Pinecone (managed cloud service), indexes documents, and performs queries. Ideal for production deployments without infrastructure management.

import pinecone
from sentence_transformers import SentenceTransformer
import os

class PineconeRAG:
    """
    RAG implementation using Pinecone (managed cloud vector database).
    
    Pinecone handles infrastructure, scaling, and optimization automatically.
    Good for production systems where you want minimal DevOps overhead.
    """
    
    def __init__(self, api_key, environment, index_name="rag-index"):
        """
        Initialize Pinecone connection.
        
        Args:
            api_key: Pinecone API key (get from pinecone.io)
            environment: Pinecone environment (e.g., "us-west1-gcp")
            index_name: Name of the index to create/use
        """
        # Initialize Pinecone
        pinecone.init(api_key=api_key, environment=environment)
        
        # Get or create index
        if index_name not in pinecone.list_indexes():
            # Create index with specifications
            pinecone.create_index(
                index_name,
                dimension=384,  # Match embedding dimension
                metric="cosine",  # Use cosine similarity
                metadata_config={"indexed": ["category", "year", "source"]}  # Indexed metadata for filtering
            )
            print(f"Created new index: {index_name}")
        else:
            print(f"Using existing index: {index_name}")
        
        self.index = pinecone.Index(index_name)
        self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
        print(f"Connected to Pinecone index: {index_name}")
    
    def add_documents(self, documents, ids=None, metadatas=None):
        """
        Add documents to Pinecone index.
        
        Args:
            documents: List of document strings
            ids: Optional list of IDs
            metadatas: Optional list of metadata dicts
        """
        if ids is None:
            ids = [f"doc_{i}" for i in range(len(documents))]
        
        if metadatas is None:
            metadatas = [{}] * len(documents)
        
        # Generate embeddings
        embeddings = self.embedder.encode(documents, show_progress_bar=True)
        
        # Prepare vectors for upsert (Pinecone format)
        vectors = []
        for i, (doc_id, embedding, metadata) in enumerate(zip(ids, embeddings, metadatas)):
            vectors.append({
                "id": doc_id,
                "values": embedding.tolist(),
                "metadata": {**metadata, "text": documents[i]}  # Store text in metadata
            })
        
        # Upsert in batches (Pinecone supports batch operations)
        batch_size = 100
        for i in range(0, len(vectors), batch_size):
            batch = vectors[i:i + batch_size]
            self.index.upsert(vectors=batch)
        
        print(f"Added {len(documents)} documents to Pinecone")
        print(f"Index stats: {self.index.describe_index_stats()}")
    
    def query(self, query_text, top_k=5, filter_dict=None):
        """
        Query Pinecone index.
        
        Args:
            query_text: User query
            top_k: Number of results
            filter_dict: Optional metadata filter (e.g., {"category": "AI"})
            
        Returns:
            Query results with documents, scores, and metadata
        """
        # Generate query embedding
        query_embedding = self.embedder.encode([query_text])[0].tolist()
        
        # Build query
        query_kwargs = {
            "vector": query_embedding,
            "top_k": top_k,
            "include_metadata": True
        }
        
        # Add filter if provided
        if filter_dict:
            query_kwargs["filter"] = filter_dict
        
        # Execute query
        results = self.index.query(**query_kwargs)
        
        # Format results
        formatted_results = []
        for match in results['matches']:
            formatted_results.append({
                'id': match['id'],
                'score': match['score'],  # Pinecone returns similarity scores (higher = more similar)
                'text': match['metadata'].get('text', ''),
                'metadata': {k: v for k, v in match['metadata'].items() if k != 'text'}
            })
        
        return formatted_results

# Example usage
# Initialize (requires Pinecone API key)
# rag = PineconeRAG(
#     api_key=os.getenv("PINECONE_API_KEY"),
#     environment="us-west1-gcp",
#     index_name="rag-tutorial"
# )
#
# # Add documents
# documents = ["Machine learning is...", "Deep learning uses...", ...]
# rag.add_documents(documents, metadatas=[{"category": "AI"}, ...])
#
# # Query
# results = rag.query("What is machine learning?", top_k=3)
# for result in results:
#     print(f"Score: {result['score']:.3f}")
#     print(f"Text: {result['text']}")
#     print(f"Metadata: {result['metadata']}\n")
Key Points:
  • Managed service: No infrastructure to manage - Pinecone handles everything
  • Automatic scaling: Handles millions of vectors without configuration
  • Metadata indexing: Can index specific metadata fields for fast filtering
  • Batch operations: Efficient batch upsert for large document sets
  • Cost: Pay-per-use pricing, good for production but has ongoing costs

3. FAISS - High-Performance Self-Hosted Solution

What this does: Sets up FAISS (Facebook AI Similarity Search) for maximum performance and control. Best for research, self-hosted solutions, or when you need fine-grained control.

import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

class FAISSRAG:
    """
    RAG implementation using FAISS for vector similarity search.
    
    FAISS provides maximum performance and flexibility, but requires
    more setup and infrastructure management than managed services.
    """
    
    def __init__(self, dimension=384, index_type="L2"):
        """
        Initialize FAISS index.
        
        Args:
            dimension: Embedding dimension (384 for all-MiniLM-L6-v2)
            index_type: "L2" (Euclidean) or "IP" (Inner Product/Cosine for normalized)
        """
        self.dimension = dimension
        self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
        
        # Create FAISS index
        # IndexFlatL2: Exact search (slower but accurate)
        # IndexIVFFlat: Approximate search (faster, good for large datasets)
        # For production, use IndexIVFFlat or IndexHNSWFlat for speed
        
        # Simple exact search index (for small datasets)
        if index_type == "L2":
            self.index = faiss.IndexFlatL2(dimension)
        else:  # Inner product (for normalized embeddings = cosine similarity)
            self.index = faiss.IndexFlatIP(dimension)
        
        # Store document texts and metadata
        self.documents = []
        self.metadatas = []
        print(f"Initialized FAISS index: {index_type}, dimension: {dimension}")
    
    def add_documents(self, documents, metadatas=None):
        """
        Add documents to FAISS index.
        
        Args:
            documents: List of document strings
            metadatas: Optional list of metadata dictionaries
        """
        if not documents:
            raise ValueError("Documents list cannot be empty")
        
        # Generate embeddings
        embeddings = self.embedder.encode(documents, show_progress_bar=True)
        
        # Normalize embeddings for cosine similarity (if using IP index)
        if isinstance(self.index, faiss.IndexFlatIP):
            faiss.normalize_L2(embeddings)  # Normalize for cosine similarity
        
        # Convert to numpy array (float32 for FAISS)
        embeddings = np.array(embeddings).astype('float32')
        
        # Add to index
        self.index.add(embeddings)
        
        # Store documents and metadata
        self.documents.extend(documents)
        if metadatas:
            self.metadatas.extend(metadatas)
        else:
            self.metadatas.extend([{}] * len(documents))
        
        print(f"Added {len(documents)} documents. Index now has {self.index.ntotal} vectors")
    
    def search(self, query_text, top_k=5):
        """
        Search for similar documents.
        
        Args:
            query_text: User query
            top_k: Number of results
            
        Returns:
            List of (document, distance, metadata) tuples
        """
        if self.index.ntotal == 0:
            raise ValueError("Index is empty. Add documents first.")
        
        # Generate query embedding
        query_embedding = self.embedder.encode([query_text])
        
        # Normalize if using IP index
        if isinstance(self.index, faiss.IndexFlatIP):
            query_embedding = query_embedding.astype('float32')
            faiss.normalize_L2(query_embedding)
        else:
            query_embedding = query_embedding.astype('float32')
        
        # Search
        distances, indices = self.index.search(query_embedding, top_k)
        
        # Format results
        results = []
        for i, (distance, idx) in enumerate(zip(distances[0], indices[0])):
            if idx < len(self.documents):  # Valid index
                results.append({
                    'document': self.documents[idx],
                    'distance': float(distance),
                    'similarity': 1.0 - float(distance) if isinstance(self.index, faiss.IndexFlatL2) else float(distance),
                    'metadata': self.metadatas[idx] if idx < len(self.metadatas) else {}
                })
        
        return results
    
    def save_index(self, filepath):
        """Save FAISS index to disk."""
        faiss.write_index(self.index, filepath)
        print(f"Saved index to {filepath}")
    
    def load_index(self, filepath):
        """Load FAISS index from disk."""
        self.index = faiss.read_index(filepath)
        print(f"Loaded index from {filepath}")

# Example usage
rag = FAISSRAG(dimension=384, index_type="IP")  # IP = Inner Product (cosine for normalized)

# Add documents
documents = [
    "Machine learning algorithms learn from data.",
    "Deep learning uses neural networks.",
    "Python is a programming language."
]
rag.add_documents(documents, metadatas=[
    {"topic": "ML"}, {"topic": "DL"}, {"topic": "programming"}
])

# Search
results = rag.search("What is machine learning?", top_k=2)
for result in results:
    print(f"Similarity: {result['similarity']:.3f}")
    print(f"Document: {result['document']}\n")

# Save for later use
rag.save_index("faiss_index.bin")
Key Points:
  • Maximum performance: FAISS is highly optimized C++ code, very fast
  • Full control: You control indexing, storage, and search parameters
  • Index types: Choose between exact (IndexFlat) or approximate (IndexIVF, IndexHNSW) search
  • Normalization: For cosine similarity with IP index, normalize embeddings first
  • Persistence: Can save/load indexes to disk
  • Use case: Best for research, maximum performance needs, or when you want full control

4. Metadata Filtering Implementation

What this does: Demonstrates how to combine vector similarity search with metadata filtering for precise retrieval.

import chromadb
from sentence_transformers import SentenceTransformer

class FilteredVectorSearch:
    """
    Vector search with metadata filtering capabilities.
    
    Filters documents by metadata before or after similarity search,
    enabling precise retrieval based on document properties.
    """
    
    def __init__(self):
        self.client = chromadb.Client()
        self.collection = self.client.create_collection(
            name="filtered_docs",
            metadata={"hnsw:space": "cosine"}
        )
        self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
    
    def add_documents_with_metadata(self, documents, metadatas):
        """
        Add documents with rich metadata for filtering.
        
        Args:
            documents: List of document texts
            metadatas: List of metadata dicts with filterable fields
        """
        embeddings = self.embedder.encode(documents).tolist()
        ids = [f"doc_{i}" for i in range(len(documents))]
        
        self.collection.add(
            documents=documents,
            embeddings=embeddings,
            ids=ids,
            metadatas=metadatas
        )
    
    def search_with_filters(self, query, top_k=5, filters=None):
        """
        Search with metadata filters.
        
        Args:
            query: Search query
            top_k: Number of results
            filters: Metadata filter dict (e.g., {"year": 2024, "category": "AI"})
            
        Returns:
            Filtered search results
        """
        query_embedding = self.embedder.encode([query]).tolist()
        
        query_kwargs = {
            "query_embeddings": query_embedding,
            "n_results": top_k
        }
        
        # Add metadata filters
        if filters:
            # Chroma supports where clauses
            where_clause = {}
            for key, value in filters.items():
                if isinstance(value, list):
                    where_clause[key] = {"$in": value}  # IN operator
                elif isinstance(value, dict) and "$gte" in value:
                    where_clause[key] = value  # Range operators
                else:
                    where_clause[key] = value  # Equality
            
            query_kwargs["where"] = where_clause
        
        results = self.collection.query(**query_kwargs)
        return results

# Example: Filtered search
search = FilteredVectorSearch()

# Add documents with metadata
documents = [
    "Machine learning tutorial for beginners",
    "Advanced deep learning techniques",
    "Python programming guide",
    "Machine learning research paper 2024"
]

metadatas = [
    {"category": "tutorial", "year": 2023, "difficulty": "beginner"},
    {"category": "tutorial", "year": 2024, "difficulty": "advanced"},
    {"category": "programming", "year": 2023, "difficulty": "beginner"},
    {"category": "research", "year": 2024, "difficulty": "advanced"}
]

search.add_documents_with_metadata(documents, metadatas)

# Search with filters
results = search.search_with_filters(
    "machine learning",
    top_k=3,
    filters={"year": 2024, "category": "tutorial"}  # Only 2024 tutorials
)

print("Filtered results (2024 tutorials only):")
for doc in results['documents'][0]:
    print(f"- {doc}")

Installation Requirements

Install the required packages based on your vector database choice:

# For Chroma
pip install chromadb sentence-transformers

# For Pinecone
pip install pinecone-client sentence-transformers

# For FAISS
pip install faiss-cpu sentence-transformers  # CPU version
# or: pip install faiss-gpu sentence-transformers  # GPU version (faster)

Real-World Applications

Chunking in RAG Systems

Document processing: Split large documents into manageable chunks for embedding and retrieval

Context management: Ensure chunks fit within LLM context windows

Retrieval optimization: Smaller, focused chunks improve retrieval precision

Storage efficiency: Balance between chunk size and storage costs

Best Practices

Chunk size: 200-1000 tokens (depends on embedding model and LLM context)

Overlap: 10-20% of chunk size

Strategy: Use semantic chunking when possible, fallback to sentence-based, then fixed-size

Testing: Evaluate retrieval quality with different chunk sizes

Test Your Understanding

Question 1: What is a vector database?

A) A regular SQL database
B) While vector databases do store data like traditional databases, they're specifically designed for high-dimensional vector similarity search using algorithms like HNSW and IVF, which is fundamentally different from SQL query-based retrieval
C) Vector databases are caching layers that temporarily store frequently accessed data to improve response times in applications
D) A specialized database optimized for storing and efficiently searching high-dimensional vector embeddings using similarity search algorithms

Question 2: Interview question: "What are the key features of vector databases for RAG?"

A) A vector database is primarily a text indexing system that stores documents as plain text files with basic search capabilities
B) A file system
C) Although vector databases handle storage similar to file systems, their core functionality is optimized approximate nearest neighbor search for embeddings, not just file organization or basic text storage
D) Fast approximate nearest neighbor (ANN) search, scalable to millions of vectors, metadata filtering, real-time updates, and support for similarity metrics (cosine, Euclidean, dot product)

Question 3: What are popular vector databases used in RAG systems?

A) While vector databases do store data like traditional databases, they're specifically designed for high-dimensional vector similarity search using algorithms like HNSW and IVF, which is fundamentally different from SQL query-based retrieval
B) Vector databases are essentially the same as traditional relational databases, just with a different storage format for data organization
C) Pinecone, Weaviate, Chroma, Qdrant, FAISS, Milvus, and pgvector (PostgreSQL extension)
D) A regular SQL database

Question 4: Interview question: "What is HNSW indexing and why is it used in vector databases?"

A) A text storage system
B) Hierarchical Navigable Small World - a graph-based ANN algorithm that provides fast approximate search with good accuracy, commonly used in production vector databases
C) Although vector databases handle storage similar to file systems, their core functionality is optimized approximate nearest neighbor search for embeddings, not just file organization or basic text storage
D) Vector databases are file systems optimized for storing large amounts of data, similar to cloud storage services but with faster access

Question 5: What is IVF (Inverted File Index) in vector databases?

A) Vector databases share some characteristics with caching systems in terms of fast access, but they're specialized for semantic similarity search using vector embeddings, which requires completely different indexing and query mechanisms
B) Vector databases are file systems optimized for storing large amounts of data, similar to cloud storage services but with faster access
C) An indexing method that partitions vector space into clusters (Voronoi cells), enabling faster search by only searching relevant clusters
D) A file system

Question 6: Interview question: "How do you choose between different vector databases?"

A) Vector databases are essentially the same as traditional relational databases, just with a different storage format for data organization
B) Consider scale (millions vs billions), deployment (cloud vs self-hosted), features (metadata filtering, real-time updates), cost, ease of use, and integration with your stack
C) A file system
D) Although vector databases handle storage similar to file systems, their core functionality is optimized approximate nearest neighbor search for embeddings, not just file organization or basic text storage

Question 7: What is metadata filtering in vector databases?

A) Filtering search results by document metadata (date, author, category) before or after similarity search, enabling precise retrieval
B) Vector databases share some characteristics with caching systems in terms of fast access, but they're specialized for semantic similarity search using vector embeddings, which requires completely different indexing and query mechanisms
C) Vector databases are essentially the same as traditional relational databases, just with a different storage format for data organization
D) A caching system

Question 8: Interview question: "What is the difference between exact and approximate nearest neighbor search?"

A) This is incorrect
B) Exact search finds true nearest neighbors but is slow for large datasets. Approximate (ANN) is much faster with high accuracy, using indexing (HNSW, IVF) to trade some accuracy for speed
C) This comprehensive approach has been considered but doesn't work well in practice
D) While this might seem reasonable, it's not the correct approach

Question 9: What is FAISS and when would you use it?

A) Facebook AI Similarity Search - a library for efficient similarity search, good for self-hosted solutions, research, and when you need fine-grained control over indexing
B) While this might seem reasonable, it's not the correct approach
C) This comprehensive approach has been considered but doesn't work well in practice
D) Not applicable

Question 10: Interview question: "How do you handle vector database updates in a production RAG system?"

A) Vector databases are file systems optimized for storing large amounts of data, similar to cloud storage services but with faster access
B) A file system
C) While vector databases do store data like traditional databases, they're specifically designed for high-dimensional vector similarity search using algorithms like HNSW and IVF, which is fundamentally different from SQL query-based retrieval
D) Implement incremental updates, batch processing for large changes, re-indexing strategies, versioning for document updates, and ensure consistency between embeddings and metadata

Question 11: What is the difference between Pinecone and self-hosted solutions like Chroma?

A) Vector databases are caching layers that temporarily store frequently accessed data to improve response times in applications
B) Pinecone is managed cloud service (easy setup, scaling, but cost). Chroma is self-hosted (more control, lower cost, but requires infrastructure management)
C) A regular SQL database
D) While vector databases do store data like traditional databases, they're specifically designed for high-dimensional vector similarity search using algorithms like HNSW and IVF, which is fundamentally different from SQL query-based retrieval

Question 12: Interview question: "How do you scale a vector database for millions of documents?"

A) Vector databases share some characteristics with caching systems in terms of fast access, but they're specialized for semantic similarity search using vector embeddings, which requires completely different indexing and query mechanisms
B) A regular SQL database
C) Vector databases are file systems optimized for storing large amounts of data, similar to cloud storage services but with faster access
D) Use distributed indexing, sharding by metadata or hash, horizontal scaling with multiple nodes, efficient indexing algorithms (HNSW), and consider approximate search for speed