Chapter 5: Retrieval Strategies

Finding Relevant Information

Learning Objectives

Understand retrieval strategies fundamentals
Master the mathematical foundations
Learn practical implementation
Apply knowledge through examples
Recognize real-world applications

Retrieval Strategies

The Heart of RAG: Finding the Right Documents

Retrieval is arguably the most critical component of a RAG system. If you retrieve irrelevant documents, even the best LLM can't generate a good answer. Retrieval strategies determine how you find the most relevant documents (or chunks) for a user's query from your knowledge base.

The retrieval challenge: Users ask questions in natural language, but your knowledge base contains documents written in different styles, using different terminology, and covering various topics. A query like "How do I train a neural network?" needs to find documents about "neural network training," "model training," "deep learning training," or "backpropagation" - even if those exact phrases don't appear in the documents.

Two Fundamental Approaches

There are two fundamentally different ways to find relevant documents:

Dense Retrieval (Semantic Search): Uses vector embeddings to find documents with similar meaning, even if they use different words. Handles synonyms, paraphrasing, and conceptual queries well.
Sparse Retrieval (Keyword Search): Uses keyword matching (BM25, TF-IDF) to find documents containing specific terms. Fast and good for exact term matching, but misses semantic relationships.

Hybrid Retrieval combines both approaches, often achieving better results than either method alone by leveraging the strengths of both semantic understanding and exact keyword matching.

Example: Why Retrieval Strategy Matters

Query: "Python machine learning library"

Dense retrieval finds: Documents about "scikit-learn," "TensorFlow," "PyTorch" (semantic similarity - these are ML libraries for Python)

Sparse retrieval finds: Documents containing exact phrase "Python machine learning library"

Hybrid retrieval finds: Both semantically similar libraries AND documents with exact phrase matches, providing comprehensive coverage

✅ Result: Hybrid retrieval gives you the best of both worlds - you don't miss relevant documents.

Key Concepts You'll Learn

Dense vs Sparse Retrieval: Understanding when to use semantic search vs keyword search, and the trade-offs between them
Hybrid Retrieval: Combining dense and sparse methods with weighted scoring for optimal results
BM25 Algorithm: The most popular sparse retrieval algorithm - how it works and why it's better than simple TF-IDF
Reranking: Using more sophisticated models (cross-encoders, LLMs) to refine initial retrieval results for better accuracy
Top-k Selection: Choosing how many documents to retrieve - balancing context richness with computational cost
Similarity Thresholds: Filtering out low-quality matches to ensure only relevant documents are used
Retrieval Evaluation: Measuring retrieval quality using precision@k, recall@k, and other metrics

Why this matters: Retrieval quality directly determines RAG system performance. Poor retrieval means the LLM receives irrelevant context, leading to inaccurate or hallucinated answers. Good retrieval ensures the LLM has the right information to generate accurate, grounded responses. Most RAG system improvements come from better retrieval, not just better LLMs.

Key Concepts

Retrieval Strategies: Dense, Sparse, and Hybrid Approaches

Retrieval is the heart of RAG systems—it's the process of finding the most relevant documents for a user's query. There are three main approaches, each with distinct characteristics and use cases.

1. Dense Retrieval (Semantic Search)

What it is: Dense retrieval uses vector embeddings to represent both queries and documents in a high-dimensional space, then finds documents whose embeddings are "close" to the query embedding using similarity metrics like cosine similarity.

How it works:

Convert the query into a dense vector using an embedding model (e.g., SentenceTransformer, OpenAI embeddings)
Compare this query vector with all document vectors in the vector database
Calculate similarity scores (cosine similarity, dot product, or Euclidean distance)
Return the top-k documents with highest similarity scores

Key Advantages:

✅ Semantic understanding: Understands meaning, not just keywords. A query about "automobile" will find documents about "car," "vehicle," or "auto" even if those exact words aren't present.
✅ Handles synonyms and paraphrasing: "Machine learning" and "ML" are treated as similar concepts. "How do I train a model?" finds documents about "model training" even if they use different wording.
✅ Context-aware: Understands that "Python programming" is different from "python snake" based on context.
✅ Language-agnostic: Works across languages if using multilingual embeddings (e.g., multilingual-MiniLM)
✅ Most common in modern RAG: This is the default approach for most production RAG systems

Limitations:

❌ More computationally expensive: Requires embedding models and vector similarity calculations
❌ May miss exact keyword matches: If a document uses very specific terminology not well-represented in the embedding model's training data, it might be missed
❌ Requires good embeddings: Quality depends heavily on the embedding model used
❌ Storage overhead: Need to store vector embeddings for all documents

Example: Dense Retrieval in Action

Query: "How do I train a neural network?"

Document 1: "Training deep learning models requires adjusting hyperparameters like learning rate..." (No exact match for "neural network")

Document 2: "Neural networks are trained using backpropagation..." (Exact match)

Result: Dense retrieval finds BOTH documents because it understands that "neural network" and "deep learning model" are semantically similar concepts, even though Document 1 doesn't contain the exact phrase.

2. Sparse Retrieval (Keyword Search)

What it is: Sparse retrieval uses keyword matching based on term frequency and document frequency. The most common algorithms are BM25 (Best Matching 25) and TF-IDF (Term Frequency-Inverse Document Frequency).

How it works:

Extract keywords from the query (tokenization, stemming, stop word removal)
Count how often each keyword appears in each document (term frequency)
Weight keywords by their rarity across all documents (inverse document frequency - rare words are more important)
Calculate a relevance score for each document based on keyword matches
Return the top-k documents with highest scores

Key Advantages:

✅ Fast: Very efficient, especially with inverted indexes (traditional search engine technology used by Google, Elasticsearch)
✅ Good for exact matches: Excellent when you need to find documents containing specific terms or phrases
✅ Interpretable: You can see exactly which keywords matched and why a document was retrieved
✅ No embedding model needed: Works with raw text, no neural networks required
✅ Low storage: Only stores keyword indexes, not full vectors

Limitations:

❌ No semantic understanding: "automobile" won't find "car" unless both terms are present in the document
❌ Requires exact term matches: Misses synonyms, paraphrases, and related concepts
❌ Language-specific: Needs language-specific tokenization and stemming
❌ Poor for conceptual queries: Struggles with queries like "things that help you learn" (no specific keywords)

Example: Sparse Retrieval Limitation

Query: "automobile safety features"

Document: "Modern cars have advanced safety systems including airbags and ABS brakes."

Result: ❌ Sparse retrieval might MISS this document because it doesn't contain "automobile" or "safety features" as exact terms, even though the document is highly relevant (mentions "cars" and "safety systems").

3. Hybrid Retrieval: Best of Both Worlds

What it is: Hybrid retrieval combines dense and sparse retrieval, taking advantage of both approaches' strengths. This is often the best approach for production RAG systems.

How it works:

Run both dense retrieval and sparse retrieval separately
Get top-k results from each method (e.g., top 50 from each)
Combine the results (union or intersection of result sets)
Rerank the combined results using a weighted score: final_score = α × dense_score + (1-α) × sparse_score where α is typically 0.3-0.7
Return the top-k documents from the reranked list

Why Hybrid Works Better:

✅ Catches both semantic and exact matches: Dense finds semantically similar docs, sparse finds exact keyword matches
✅ More comprehensive: Less likely to miss relevant documents
✅ Better for diverse queries: Some queries benefit from semantic search (conceptual questions), others from keyword search (specific term lookups)
✅ Production-proven: Used by many production systems (e.g., Google Search uses hybrid approaches)

Example: Hybrid Retrieval

Query: "Python machine learning library"

Dense retrieval finds: Documents about "scikit-learn," "TensorFlow," "PyTorch" (semantic similarity - these are ML libraries for Python)

Sparse retrieval finds: Documents containing exact phrase "Python machine learning library"

Hybrid result: Combines both, ensuring you get both semantically similar libraries AND documents that explicitly mention the exact phrase. This gives you comprehensive coverage.

When to use:

✅ When you have diverse query types (some need semantic understanding, others need exact matches)
✅ When retrieval quality is critical and you can afford the computational cost
✅ For production systems where you want maximum coverage and accuracy

Reranking: Refining Initial Retrieval Results

Why Reranking is Necessary

Initial retrieval (whether dense, sparse, or hybrid) is fast but not perfect. It uses relatively simple similarity calculations that may not capture the nuanced relationship between a query and a document. Reranking is a second-stage refinement that improves relevance by using more sophisticated (but slower) models.

The Problem with Initial Retrieval:

❌ Embedding similarity is approximate: Two documents might have similar embeddings but different relevance to a specific query. For example, two documents about "Python" might be similar (both about programming), but one might be about "Python for data science" and the other about "Python web development" - only one is relevant to a query about "data science."
❌ Keyword matching is surface-level: A document might contain all query keywords but not actually answer the question. For example, a document might mention "machine learning" and "tutorial" but be about a different topic entirely.
❌ Context matters: The same words in different contexts have different meanings. "Bank" could mean financial institution or river bank.
❌ Query intent: Initial retrieval doesn't deeply understand what the user is really asking. A query "how to install" might retrieve documents about installation in general, but the user might specifically want installation on Windows.

The Solution: Reranking

Reranking takes the top-k candidates from initial retrieval (e.g., top 100) and uses a more sophisticated model to re-score and reorder them, producing a final top-k list (e.g., top 5-10) that better matches query intent.

Reranking Methods

1. Cross-Encoder Reranking

What it is: A neural network model (typically BERT-based) that processes the query and document together in a single forward pass, allowing the model to see the full interaction between query and document through attention mechanisms.

How it works:

Input: [CLS] query [SEP] document [SEP] (concatenated query and document)
The model processes both simultaneously, allowing attention mechanisms to identify which parts of the document are most relevant to the query
Output: A relevance score (typically 0-1 or a similarity score)
Documents are reranked by these scores

Advantages:

✅ More accurate: Significantly better than embedding-based similarity because it sees the full query-document interaction
✅ Context-aware: Understands how query and document relate in context
✅ Better for nuanced queries: Handles complex questions that require deep understanding
✅ Production-ready models: Pre-trained models available (e.g., ms-marco-MiniLM, cross-encoder models from SentenceTransformers)

Disadvantages:

❌ Slower: Must process each query-document pair separately (can't batch efficiently for different queries)
❌ More expensive: Requires running a neural network for each candidate
❌ Limited to top-k: Too slow to use on entire document collection (typically used on top 50-200 candidates)

When to use: When you have a small candidate set (50-200 documents) from initial retrieval and want maximum accuracy. This is the most common reranking approach in production RAG systems.

2. LLM-Based Reranking

What it is: Uses a large language model (like GPT-4, Claude) to score or rank documents based on their relevance to the query.

How it works:

For each candidate document, create a prompt asking the LLM to score relevance
Example prompt: "Rate how relevant this document is to the query 'X' on a scale of 1-10. Document: [document text]"
Use LLM's output to rerank documents
Alternatively, ask LLM to directly rank documents in order of relevance

Advantages:

✅ Most accurate: LLMs have deep understanding of language and context
✅ Handles complex queries: Can understand nuanced questions and multi-part queries
✅ Flexible: Can be prompted to consider specific factors (relevance, completeness, accuracy)

Disadvantages:

❌ Very slow: LLM inference is much slower than specialized reranking models
❌ Expensive: API costs can be high if reranking many documents
❌ Inconsistent: LLM outputs can be non-deterministic
❌ Limited scalability: Not practical for high-volume systems

When to use: For high-value queries where accuracy is critical and you can afford the latency and cost. Often used in research or low-volume production systems.

Reranking Best Practices

Use two-stage retrieval: Fast initial retrieval (dense/sparse) → slower reranking on top-k candidates
Typical pipeline: Retrieve top 100-200 → Rerank to top 5-10
Balance accuracy vs speed: More accurate reranking is slower; choose based on your latency requirements
Cache reranking results: For common queries, cache reranked results to avoid recomputation
Monitor performance: Track reranking accuracy, latency, and cost

Mathematical Formulations

Retrieval Scoring Formulas Overview

Retrieval strategies use mathematical formulas to score and rank documents. Understanding these formulas helps you choose the right retrieval method, tune parameters, and optimize performance. This section covers the key formulas used in dense, sparse, hybrid, and reranking approaches.

1. Hybrid Retrieval Score

\[\text{hybrid\_score}(q, d) = \alpha \cdot \text{cosine}(E(q), E(d)) + (1 - \alpha) \cdot \text{BM25}(q, d)\]

What This Formula Combines:

Hybrid retrieval combines dense (semantic) and sparse (keyword) retrieval scores using a weighted average. This leverages both semantic understanding and exact keyword matching for better results than either method alone.

Breaking It Down:

\(\text{cosine}(E(q), E(d))\): Dense retrieval score - cosine similarity between query embedding \(E(q)\) and document embedding \(E(d)\). Range: [-1, 1], typically [0, 1] for normalized embeddings.
\(\text{BM25}(q, d)\): Sparse retrieval score - BM25 keyword matching score. Range: [0, ∞), but typically normalized to [0, 1] before combination.
\(\alpha\): Weighting factor controlling the balance between dense and sparse scores. Range: [0, 1]
\((1 - \alpha)\): Weight for sparse score (ensures weights sum to 1)

Choosing \(\alpha\):

\(\alpha = 0.7\) (70% dense, 30% sparse): Emphasizes semantic similarity. Good when synonyms and paraphrasing are important.
\(\alpha = 0.5\) (50/50): Balanced approach, most common in production systems
\(\alpha = 0.3\) (30% dense, 70% sparse): Emphasizes exact keyword matching. Good when specific terminology is critical.

Why Hybrid Works Better:

Dense retrieval finds semantically similar documents (handles synonyms), while sparse retrieval finds documents with exact keyword matches. Combining both ensures you don't miss relevant documents that use different terminology OR exact phrases.

Example:

Query: "Python machine learning library"

Document 1: "scikit-learn is a Python ML library" (semantic match, no exact phrase)
Dense score: 0.85, BM25 score: 0.60
Hybrid score (α=0.7): \(0.7 \times 0.85 + 0.3 \times 0.60 = 0.775\)

Document 2: "Python machine learning library tutorial" (exact phrase match)
Dense score: 0.80, BM25 score: 0.95
Hybrid score (α=0.7): \(0.7 \times 0.80 + 0.3 \times 0.95 = 0.845\)

✅ Both documents are retrieved, with Document 2 ranked higher due to exact phrase match.

2. BM25 (Best Matching 25) Score

\[\text{BM25}(q, d) = \sum_{t \in q} \text{IDF}(t) \times \frac{f(t, d) \times (k_1 + 1)}{f(t, d) + k_1 \times (1 - b + b \times \frac{|d|}{\text{avgdl}})}\]

What This Formula Measures:

BM25 is a probabilistic ranking function that scores documents based on how well they match query keywords. It's an improvement over TF-IDF that handles document length normalization better and is widely used in search engines (including Elasticsearch, Solr).

Breaking It Down:

\(\sum_{t \in q}\): Sum over all terms \(t\) in the query \(q\)
\(\text{IDF}(t)\): Inverse Document Frequency - measures how rare/important a term is. Formula: \(\text{IDF}(t) = \log \frac{N - n(t) + 0.5}{n(t) + 0.5}\) where \(N\) is total documents and \(n(t)\) is documents containing term \(t\)
\(f(t, d)\): Term frequency - how many times term \(t\) appears in document \(d\)
\(\frac{f(t, d) \times (k_1 + 1)}{f(t, d) + k_1 \times (1 - b + b \times \frac{|d|}{\text{avgdl}})}\): Term frequency saturation function - prevents very frequent terms from dominating
\(k_1\): Term frequency saturation parameter (typically 1.2-2.0). Controls how quickly term frequency saturates
\(b\): Length normalization parameter (typically 0.75). Controls how much document length affects the score
\(|d|\): Document length (number of words/tokens)
\(\text{avgdl}\): Average document length in the collection

Key Components Explained:

IDF Component: Rare terms (low document frequency) get higher IDF scores. This means documents containing rare, specific terms score higher than documents with common terms.

Term Frequency Saturation: The fraction \(\frac{f(t, d) \times (k_1 + 1)}{f(t, d) + k_1 \times \ldots}\) ensures that:

First occurrence of a term: high contribution
Additional occurrences: diminishing returns (saturation)
Prevents documents with excessive term repetition from scoring too high

Length Normalization: The term \(1 - b + b \times \frac{|d|}{\text{avgdl}}\) penalizes very long documents. Longer documents naturally have more term matches, so this normalization prevents them from always ranking highest.

Typical Parameter Values:

\(k_1 = 1.2\): Standard value, good for most use cases
\(b = 0.75\): Standard value, moderate length normalization
\(b = 0\): No length normalization (rarely used)
\(b = 1\): Full length normalization (strong penalty for long docs)

Example:

Query: "machine learning"
Document 1: "Machine learning is a subset of AI. Machine learning uses algorithms." (2 occurrences, short doc)
Document 2: "Machine learning machine learning machine learning..." (10 occurrences, very long doc)

BM25 gives Document 1 a higher score because:

Length normalization penalizes Document 2's excessive length
Term frequency saturation means the extra occurrences in Document 2 don't help much
Document 1 has a better balance of term frequency and document length

Why BM25 is Better Than TF-IDF:

✅ Better length normalization: Handles document length more intelligently
✅ Term frequency saturation: Prevents excessive term repetition from dominating scores
✅ Proven in practice: Used by major search engines and information retrieval systems

3. Inverse Document Frequency (IDF)

\[\text{IDF}(t) = \log \frac{N - n(t) + 0.5}{n(t) + 0.5}\]

What This Measures:

IDF measures how "rare" or "important" a term is across the entire document collection. Rare terms (appearing in few documents) get higher IDF scores, making them more important for ranking.

Breaking It Down:

\(N\): Total number of documents in the collection
\(n(t)\): Number of documents containing term \(t\)
\(N - n(t)\): Number of documents NOT containing term \(t\)
\(+ 0.5\): Smoothing factor to avoid division by zero and handle edge cases
\(\log\): Logarithm (typically natural log or base 2) - compresses the range

Intuition:

If a term appears in many documents (common word like "the", "is"), it's not very informative - low IDF. If a term appears in few documents (rare word like "quantum", "neural"), it's very informative - high IDF.

Example:

Collection: 10,000 documents

Term "the": appears in 9,500 documents
IDF: \(\log \frac{10000 - 9500 + 0.5}{9500 + 0.5} = \log \frac{500.5}{9500.5} \approx -2.95\) (very low, common word)

Term "quantum": appears in 50 documents
IDF: \(\log \frac{10000 - 50 + 0.5}{50 + 0.5} = \log \frac{9950.5}{50.5} \approx 5.3\) (very high, rare and informative)

When query contains "quantum computing", documents with "quantum" score much higher because it's a rare, informative term.

4. Cross-Encoder Reranking Score

\[\text{rerank\_score}(q, d) = \text{CrossEncoder}([q; d])\]

What This Represents:

A cross-encoder is a neural network model (typically BERT-based) that processes the query and document together in a single forward pass. The notation \([q; d]\) means the query and document are concatenated: [CLS] query [SEP] document [SEP].

Breaking It Down:

\(q\): Query text
\(d\): Document text
\([q; d]\): Concatenated input: query and document together
\(\text{CrossEncoder}\): Neural network model that processes the concatenated input
Output: Relevance score (typically 0-1 or similarity score)

Why Cross-Encoder is More Accurate:

Bi-encoder (embedding-based): Processes query and document separately, then compares embeddings. Fast but loses interaction information.

Cross-encoder: Processes query and document together, allowing attention mechanisms to identify which parts of the document are most relevant to the query. More accurate but slower.

Attention Mechanism:

The cross-encoder's attention mechanism allows the model to "look" at specific parts of the document when processing the query. For example, when the query is "capital of France", the model can focus attention on the part of the document that mentions "Paris" or "capital".

Performance Trade-off:

Accuracy: Cross-encoder typically achieves 10-30% better accuracy than bi-encoder
Speed: Cross-encoder is 10-100x slower because it must process each (query, document) pair separately
Use case: Rerank top-50 to top-200 candidates from initial retrieval (too slow for full collection)

Example:

Query: "How to train a neural network?"
Document: "Training deep learning models requires adjusting hyperparameters..."

Bi-encoder: Embed query and document separately, compare embeddings → score: 0.75

Cross-encoder: Process "[CLS] How to train a neural network? [SEP] Training deep learning models requires adjusting hyperparameters... [SEP]" → score: 0.92

✅ Cross-encoder sees the full interaction and understands that "neural network" and "deep learning models" are related in this context.

5. Normalized Hybrid Score

\[\text{hybrid\_score}(q, d) = \alpha \cdot \frac{\text{cosine}(E(q), E(d)) - \min_{\text{cosine}}}{\max_{\text{cosine}} - \min_{\text{cosine}}} + (1 - \alpha) \cdot \frac{\text{BM25}(q, d) - \min_{\text{BM25}}}{\max_{\text{BM25}} - \min_{\text{BM25}}}\]

What This Does:

Before combining dense and sparse scores, they should be normalized to the same scale (typically [0, 1]). This ensures that one method doesn't dominate simply because its scores are larger.

Breaking It Down:

Min-max normalization: \(\frac{x - \min}{\max - \min}\) scales values to [0, 1] range
\(\min_{\text{cosine}}, \max_{\text{cosine}}\): Minimum and maximum cosine similarity scores across all documents for this query
\(\min_{\text{BM25}}, \max_{\text{BM25}}\): Minimum and maximum BM25 scores across all documents for this query
After normalization, both scores are in [0, 1] range, making them comparable

Why Normalization is Important:

Without normalization, if BM25 scores range from 0-100 and cosine scores range from 0-1, the BM25 component would dominate the hybrid score regardless of the \(\alpha\) value. Normalization ensures both components contribute proportionally.

Example:

For a query, you get:

Cosine scores: [0.65, 0.72, 0.58, 0.80] → normalized: [0.27, 0.64, 0.0, 1.0]
BM25 scores: [12.5, 45.3, 8.2, 67.8] → normalized: [0.07, 0.62, 0.0, 1.0]

Now both are on the same [0, 1] scale and can be meaningfully combined with \(\alpha = 0.7\).

Detailed Examples

Step-by-Step Examples

Example: Hybrid Search

Query: "Python machine learning library"

Vector search: Finds documents semantically similar (e.g., "scikit-learn", "TensorFlow")

Keyword search: Finds documents with exact terms ("Python", "machine learning", "library")

Hybrid: Combines both, retrieves documents that are both semantically relevant AND contain keywords

Result: Better retrieval quality than either method alone.

Example: Reranking

Initial retrieval: Top-100 documents from vector search

Reranking: Use cross-encoder to score each (query, document) pair

Result: Top-5 most relevant documents after reranking

Benefit: Cross-encoder sees full query and document, more accurate than embedding similarity alone.

Implementation

Implementation Overview

This section provides practical Python code examples for implementing retrieval strategies in RAG systems. The examples demonstrate dense retrieval, sparse retrieval (BM25), hybrid retrieval, and reranking with cross-encoders. These implementations form the core of production RAG systems.

1. Complete Hybrid Retrieval System

What this does: Implements a production-ready hybrid retrieval system that combines dense (semantic) and sparse (keyword) retrieval with proper score normalization and weighting.

from sentence_transformers import SentenceTransformer
from rank_bm25 import BM25Okapi
import numpy as np
from typing import List, Dict, Optional

class HybridRetriever:
    """
    Production-ready hybrid retrieval system.
    
    Combines dense (semantic) and sparse (keyword) retrieval
    for optimal results in RAG systems.
    """
    
    def __init__(self, documents: List[str], alpha: float = 0.7):
        """
        Initialize hybrid retriever.
        
        Args:
            documents: List of document strings to index
            alpha: Weight for sparse retrieval (1-alpha for dense)
                  alpha=0.7 means 70% sparse, 30% dense
        """
        if not documents:
            raise ValueError("Documents list cannot be empty")
        
        self.documents = documents
        self.alpha = alpha
        
        # Initialize dense retriever (embedding-based)
        print("Initializing embedding model...")
        self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
        self.embeddings = self.embedder.encode(documents, show_progress_bar=True)
        print(f"Generated {len(self.embeddings)} embeddings (dimension: {self.embeddings.shape[1]})")
        
        # Initialize sparse retriever (BM25)
        print("Building BM25 index...")
        tokenized_docs = [self._tokenize(doc) for doc in documents]
        self.bm25 = BM25Okapi(tokenized_docs)
        print("BM25 index built")
    
    def _tokenize(self, text: str) -> List[str]:
        """Tokenize text for BM25 (simple word splitting)."""
        # In production, use proper tokenization (NLTK, spaCy)
        return text.lower().split()
    
    def retrieve(self, query: str, top_k: int = 5, normalize: bool = True) -> List[Dict]:
        """
        Retrieve documents using hybrid approach.
        
        Args:
            query: User query string
            top_k: Number of documents to retrieve
            normalize: Whether to normalize scores before combining
            
        Returns:
            List of dicts with 'document', 'dense_score', 'sparse_score', 'hybrid_score'
        """
        # Step 1: Dense retrieval (semantic search)
        query_embedding = self.embedder.encode([query])
        dense_scores = np.dot(self.embeddings, query_embedding.T).flatten()
        
        # Step 2: Sparse retrieval (BM25 keyword search)
        tokenized_query = self._tokenize(query)
        sparse_scores = self.bm25.get_scores(tokenized_query)
        
        # Step 3: Normalize scores to [0, 1] range for fair combination
        if normalize:
            dense_scores = self._normalize(dense_scores)
            sparse_scores = self._normalize(sparse_scores)
        
        # Step 4: Combine scores
        # alpha * sparse + (1-alpha) * dense
        hybrid_scores = self.alpha * sparse_scores + (1 - self.alpha) * dense_scores
        
        # Step 5: Get top-k
        top_indices = np.argsort(hybrid_scores)[-top_k:][::-1]
        
        # Format results
        results = []
        for idx in top_indices:
            results.append({
                'document': self.documents[idx],
                'dense_score': float(dense_scores[idx]),
                'sparse_score': float(sparse_scores[idx]),
                'hybrid_score': float(hybrid_scores[idx]),
                'rank': len(results) + 1
            })
        
        return results
    
    def _normalize(self, scores: np.ndarray) -> np.ndarray:
        """Normalize scores to [0, 1] range using min-max normalization."""
        min_score = scores.min()
        max_score = scores.max()
        if max_score - min_score < 1e-8:  # Avoid division by zero
            return np.ones_like(scores)
        return (scores - min_score) / (max_score - min_score)

# Example usage
documents = [
    "Python machine learning library scikit-learn provides tools for data analysis",
    "TensorFlow is a deep learning framework for neural networks",
    "Machine learning algorithms learn patterns from data automatically",
    "Natural language processing helps computers understand human language",
    "Python programming language is popular for data science"
]

retriever = HybridRetriever(documents, alpha=0.6)  # 60% sparse, 40% dense

# Query
query = "Python machine learning"
results = retriever.retrieve(query, top_k=3)

print(f"\nQuery: '{query}'")
print(f"\nTop-3 Results:")
for result in results:
    print(f"\nRank {result['rank']}:")
    print(f"  Document: {result['document']}")
    print(f"  Dense score: {result['dense_score']:.3f}")
    print(f"  Sparse score: {result['sparse_score']:.3f}")
    print(f"  Hybrid score: {result['hybrid_score']:.3f}")

Key Points:

Score normalization: Critical for combining dense and sparse scores fairly
Alpha parameter: Tune alpha (0.3-0.7) based on your use case
Performance: Hybrid is slightly slower than either method alone, but provides better results
Production use: This is the standard approach in most production RAG systems

2. BM25 Implementation from Scratch

What this does: Implements BM25 algorithm from scratch to understand how it works. For production, use libraries like `rank-bm25`, but this shows the math.

import math
from collections import Counter
from typing import List, Dict

class BM25:
    """
    BM25 (Best Matching 25) ranking function implementation.
    
    BM25 is a probabilistic ranking function that scores documents
    based on term frequency, inverse document frequency, and document length.
    """
    
    def __init__(self, documents: List[str], k1: float = 1.2, b: float = 0.75):
        """
        Initialize BM25 index.
        
        Args:
            documents: List of document strings
            k1: Term frequency saturation parameter (typically 1.2)
            b: Length normalization parameter (typically 0.75)
        """
        self.k1 = k1
        self.b = b
        self.documents = documents
        
        # Tokenize all documents
        self.tokenized_docs = [self._tokenize(doc) for doc in documents]
        
        # Calculate document frequencies and average document length
        self.doc_freqs = self._calculate_document_frequencies()
        self.avg_doc_length = sum(len(doc) for doc in self.tokenized_docs) / len(documents)
        self.total_docs = len(documents)
    
    def _tokenize(self, text: str) -> List[str]:
        """Simple tokenization (in production, use proper tokenizer)."""
        return text.lower().split()
    
    def _calculate_document_frequencies(self) -> Dict[str, int]:
        """Calculate how many documents contain each term."""
        doc_freqs = Counter()
        for doc_tokens in self.tokenized_docs:
            unique_terms = set(doc_tokens)
            for term in unique_terms:
                doc_freqs[term] += 1
        return doc_freqs
    
    def _idf(self, term: str) -> float:
        """
        Calculate Inverse Document Frequency (IDF) for a term.
        
        Formula: log((N - n(t) + 0.5) / (n(t) + 0.5))
        where N = total docs, n(t) = docs containing term
        """
        if term not in self.doc_freqs:
            return 0.0
        
        n_t = self.doc_freqs[term]  # Number of documents containing term
        # BM25 IDF formula
        idf = math.log((self.total_docs - n_t + 0.5) / (n_t + 0.5))
        return max(0.0, idf)  # Ensure non-negative
    
    def score(self, query: str, doc_index: int) -> float:
        """
        Calculate BM25 score for a query-document pair.
        
        Args:
            query: Query string
            doc_index: Index of document in documents list
            
        Returns:
            BM25 score
        """
        query_terms = self._tokenize(query)
        doc_tokens = self.tokenized_docs[doc_index]
        doc_length = len(doc_tokens)
        
        score = 0.0
        
        for term in query_terms:
            # Term frequency in document
            tf = doc_tokens.count(term)
            
            if tf == 0:
                continue  # Term not in document, skip
            
            # IDF component
            idf = self._idf(term)
            
            # Term frequency saturation component
            numerator = tf * (self.k1 + 1)
            denominator = tf + self.k1 * (1 - self.b + self.b * (doc_length / self.avg_doc_length))
            
            # BM25 score for this term
            term_score = idf * (numerator / denominator)
            score += term_score
        
        return score
    
    def get_scores(self, query: str) -> List[float]:
        """Get BM25 scores for query against all documents."""
        return [self.score(query, i) for i in range(len(self.documents))]
    
    def get_top_k(self, query: str, k: int = 5) -> List[tuple]:
        """
        Get top-k documents for query.
        
        Returns:
            List of (document_index, score) tuples, sorted by score descending
        """
        scores = self.get_scores(query)
        # Get indices sorted by score (descending)
        top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k]
        return [(i, scores[i]) for i in top_indices]

# Example usage
documents = [
    "machine learning algorithms learn from data",
    "deep learning uses neural networks",
    "machine learning is a subset of artificial intelligence",
    "neural networks are used in deep learning"
]

bm25 = BM25(documents)

query = "machine learning"
scores = bm25.get_scores(query)
print(f"BM25 scores for query '{query}':")
for i, (doc, score) in enumerate(zip(documents, scores)):
    print(f"  Doc {i+1}: {score:.3f} - {doc}")

top_k = bm25.get_top_k(query, k=2)
print(f"\nTop-2 documents:")
for doc_idx, score in top_k:
    print(f"  Score: {score:.3f}")
    print(f"  Document: {documents[doc_idx]}\n")

Key Points:

IDF calculation: Rare terms get higher IDF (more informative)
Term frequency saturation: Prevents excessive term repetition from dominating
Length normalization: Prevents long documents from always ranking highest
Parameters: k1=1.2, b=0.75 are standard values, but can be tuned

3. Cross-Encoder Reranking Implementation

What this does: Implements reranking using cross-encoder models to refine initial retrieval results for better accuracy.

from sentence_transformers import CrossEncoder
import numpy as np
from typing import List, Dict

class Reranker:
    """
    Cross-encoder based reranker for improving retrieval accuracy.
    
    Takes top-k candidates from initial retrieval and reranks them
    using a more sophisticated model that sees query-document interaction.
    """
    
    def __init__(self, model_name: str = 'cross-encoder/ms-marco-MiniLM-L-6-v2'):
        """
        Initialize reranker with cross-encoder model.
        
        Args:
            model_name: Pre-trained cross-encoder model name
        """
        print(f"Loading cross-encoder model: {model_name}...")
        self.model = CrossEncoder(model_name)
        print("Model loaded")
    
    def rerank(self, query: str, documents: List[str], top_k: int = 3) -> List[Dict]:
        """
        Rerank documents for a query.
        
        Args:
            query: User query
            documents: List of candidate documents to rerank
            top_k: Number of top documents to return
            
        Returns:
            List of dicts with 'document', 'score', 'rank'
        """
        if not documents:
            return []
        
        # Create query-document pairs
        # Format: [query, document] for each document
        pairs = [[query, doc] for doc in documents]
        
        # Get relevance scores from cross-encoder
        # Higher score = more relevant
        scores = self.model.predict(pairs)
        
        # Sort by score (descending) and get top-k
        ranked_indices = np.argsort(scores)[::-1][:top_k]
        
        # Format results
        results = []
        for rank, idx in enumerate(ranked_indices, 1):
            results.append({
                'document': documents[idx],
                'score': float(scores[idx]),
                'rank': rank
            })
        
        return results
    
    def rerank_batch(self, queries: List[str], documents: List[str], top_k: int = 3) -> List[List[Dict]]:
        """
        Rerank documents for multiple queries (batch processing).
        
        Args:
            queries: List of queries
            documents: List of candidate documents (same for all queries)
            top_k: Number of top documents per query
            
        Returns:
            List of results (one list per query)
        """
        all_results = []
        for query in queries:
            results = self.rerank(query, documents, top_k)
            all_results.append(results)
        return all_results

# Example: Two-stage retrieval pipeline
class TwoStageRetriever:
    """
    Complete two-stage retrieval: fast initial retrieval + accurate reranking.
    
    This is the standard production approach:
    1. Fast initial retrieval (dense/sparse/hybrid) gets top-100 candidates
    2. Reranker refines to top-5 most relevant
    """
    
    def __init__(self, initial_retriever, reranker):
        self.initial_retriever = initial_retriever  # HybridRetriever or similar
        self.reranker = reranker
    
    def retrieve(self, query: str, initial_k: int = 100, final_k: int = 5):
        """
        Two-stage retrieval pipeline.
        
        Args:
            query: User query
            initial_k: Number of candidates from initial retrieval
            final_k: Number of final results after reranking
        """
        # Stage 1: Fast initial retrieval
        print(f"Stage 1: Initial retrieval (top-{initial_k})...")
        initial_results = self.initial_retriever.retrieve(query, top_k=initial_k)
        initial_docs = [r['document'] for r in initial_results]
        
        # Stage 2: Rerank top candidates
        print(f"Stage 2: Reranking top-{initial_k} to get top-{final_k}...")
        reranked = self.reranker.rerank(query, initial_docs, top_k=final_k)
        
        return reranked

# Example usage
from sentence_transformers import SentenceTransformer

# Initialize components
embedder = SentenceTransformer('all-MiniLM-L6-v2')
initial_retriever = HybridRetriever(documents, alpha=0.6)
reranker = Reranker()

# Two-stage retrieval
two_stage = TwoStageRetriever(initial_retriever, reranker)

query = "Python machine learning"
final_results = two_stage.retrieve(query, initial_k=10, final_k=3)

print(f"\nFinal top-3 after reranking:")
for result in final_results:
    print(f"Rank {result['rank']}: {result['document']} (score: {result['score']:.3f})")

Key Points:

Two-stage approach: Fast initial retrieval (100-200 docs) → accurate reranking (top 5-10)
Accuracy improvement: Reranking typically improves precision@5 by 10-30%
Latency trade-off: Adds 50-200ms but significantly improves quality
Production standard: Most production RAG systems use this approach

Installation Requirements

Install the required packages:

pip install sentence-transformers rank-bm25 numpy

Note: For production, consider using vector databases (Pinecone, Chroma) for initial retrieval instead of in-memory implementations shown here.

Real-World Applications

Where This Is Used

Retrieval Strategies in Practice

When to use vector search: Semantic similarity is important, synonyms matter, multilingual content

When to use keyword search: Exact term matching needed, domain-specific terminology

When to use hybrid: Best of both worlds, production systems often use this

When to use reranking: Need highest accuracy, can afford extra latency

Performance Trade-offs

Vector search: Fast, good semantic understanding, but may miss exact matches

Keyword search: Very fast, exact matches, but misses synonyms

Hybrid: Balanced, combines strengths, slightly slower

Reranking: Most accurate, but slowest (processes each candidate)

Test Your Understanding

Question 1: What are the main retrieval strategies in RAG?

A) Only keyword search

B) Retrieval in RAG systems relies solely on semantic embeddings without any keyword-based methods, which provides good results for all use cases

C) Dense retrieval (semantic search), sparse retrieval (keyword/BM25), and hybrid retrieval (combining both)

D) Random or sequential document selection might seem straightforward, but effective RAG retrieval requires similarity-based ranking using embeddings or keyword scores to find the most relevant documents, not arbitrary selection methods

Question 2: Interview question: "What is the difference between dense and sparse retrieval?"

A) Dense uses embeddings for semantic similarity (handles synonyms, paraphrasing). Sparse uses keyword matching (BM25, TF-IDF) for exact term matching. Dense is better for meaning, sparse for keywords

B) While keyword-only search is fast and effective for exact term matching, RAG systems benefit from combining it with semantic search to handle synonyms and paraphrasing, making pure keyword search insufficient for comprehensive retrieval

C) The primary retrieval approach is to randomly select documents from the knowledge base, which ensures fair distribution of results

D) Sequential search

Question 3: What is hybrid retrieval and why is it effective?

A) Combining dense and sparse retrieval scores, leveraging semantic understanding and keyword matching together for better results than either alone

B) Only keyword search

C) Retrieval strategies focus on sequential scanning of all documents in order, which guarantees comprehensive coverage of the knowledge base

Question 4: In the hybrid score formula \(\text{score}(q, d) = \alpha \times \text{BM25}(q, d) + (1 - \alpha) \times \text{cosine}(E(q), E(d))\), what does \(\alpha\) control?

A) The weight between sparse (BM25) and dense (cosine similarity) scores. Higher α emphasizes keywords, lower α emphasizes semantics

B) Random selection

C) The primary retrieval approach is to randomly select documents from the knowledge base, which ensures fair distribution of results

D) While keyword-only search is fast and effective for exact term matching, RAG systems benefit from combining it with semantic search to handle synonyms and paraphrasing, making pure keyword search insufficient for comprehensive retrieval

Question 5: Interview question: "What is BM25 and how does it work?"

A) Sequential search

B) The main retrieval strategies involve only using keyword matching without any semantic understanding, which works well for exact term searches

C) Best Matching 25 - a probabilistic ranking function that scores documents based on term frequency, inverse document frequency, and document length normalization. Better than TF-IDF for retrieval

Question 6: What is reranking in retrieval?

A) Re-scoring initial retrieval results using more accurate models (cross-encoders) to improve relevance ordering before passing to LLM

B) The main retrieval strategies involve only using keyword matching without any semantic understanding, which works well for exact term searches

C) Sequential search

Question 7: Interview question: "When would you use dense vs sparse retrieval?"

A) Although semantic search alone provides good results for many queries, combining it with keyword-based methods like BM25 creates hybrid retrieval that captures both semantic meaning and exact term matches, which is more effective than either method alone

B) Dense for semantic queries, synonyms, paraphrasing. Sparse for exact keyword matching, technical terms, names. Hybrid for best of both

C) Only semantic search

D) Retrieval strategies focus on sequential scanning of all documents in order, which guarantees comprehensive coverage of the knowledge base

Question 8: What is a cross-encoder and how is it used in reranking?

B) Only keyword search

C) A model that processes query and document together (not separately like bi-encoder), providing more accurate relevance scores but slower. Used for reranking top-k results

D) Retrieval in RAG systems relies solely on semantic embeddings without any keyword-based methods, which provides good results for all use cases

Question 9: Interview question: "How do you choose top-k for retrieval?"

A) Retrieval strategies focus on sequential scanning of all documents in order, which guarantees comprehensive coverage of the knowledge base

B) Although semantic search alone provides good results for many queries, combining it with keyword-based methods like BM25 creates hybrid retrieval that captures both semantic meaning and exact term matches, which is more effective than either method alone

C) Only semantic search

D) Balance context quality (more docs = more info) vs cost/latency (fewer = faster, cheaper). Typical: 3-10. Test on your data. Use reranking if retrieving more initially

Question 10: What is the typical value of α in hybrid retrieval?

A) Only semantic search

B) Random or sequential document selection might seem straightforward, but effective RAG retrieval requires similarity-based ranking using embeddings or keyword scores to find the most relevant documents, not arbitrary selection methods

C) The main retrieval strategies involve only using keyword matching without any semantic understanding, which works well for exact term searches

D) 0.3-0.7, often around 0.5. Tune based on your data - more semantic queries need lower α, more keyword queries need higher α

Question 11: Interview question: "How do you evaluate retrieval quality?"

A) Retrieval strategies focus on sequential scanning of all documents in order, which guarantees comprehensive coverage of the knowledge base

B) Use metrics like precision@k, recall@k, MRR (Mean Reciprocal Rank), NDCG. Test on labeled query-document pairs. Measure downstream RAG answer quality

C) While keyword-only search is fast and effective for exact term matching, RAG systems benefit from combining it with semantic search to handle synonyms and paraphrasing, making pure keyword search insufficient for comprehensive retrieval

D) Random selection

Question 12: What is the trade-off between retrieval speed and accuracy?

A) The main retrieval strategies involve only using keyword matching without any semantic understanding, which works well for exact term searches

C) More accurate methods (reranking, exact search) are slower. Approximate search (ANN) is faster but less accurate. Balance based on latency requirements and quality needs

D) Only keyword search