Chapter 5: Retrieval Strategies
Finding Relevant Information
Learning Objectives
- Understand retrieval strategies fundamentals
- Master the mathematical foundations
- Learn practical implementation
- Apply knowledge through examples
- Recognize real-world applications
Retrieval Strategies
The Heart of RAG: Finding the Right Documents
Retrieval is arguably the most critical component of a RAG system. If you retrieve irrelevant documents, even the best LLM can't generate a good answer. Retrieval strategies determine how you find the most relevant documents (or chunks) for a user's query from your knowledge base.
The retrieval challenge: Users ask questions in natural language, but your knowledge base contains documents written in different styles, using different terminology, and covering various topics. A query like "How do I train a neural network?" needs to find documents about "neural network training," "model training," "deep learning training," or "backpropagation" - even if those exact phrases don't appear in the documents.
Two Fundamental Approaches
There are two fundamentally different ways to find relevant documents:
- Dense Retrieval (Semantic Search): Uses vector embeddings to find documents with similar meaning, even if they use different words. Handles synonyms, paraphrasing, and conceptual queries well.
- Sparse Retrieval (Keyword Search): Uses keyword matching (BM25, TF-IDF) to find documents containing specific terms. Fast and good for exact term matching, but misses semantic relationships.
Hybrid Retrieval combines both approaches, often achieving better results than either method alone by leveraging the strengths of both semantic understanding and exact keyword matching.
Example: Why Retrieval Strategy Matters
Query: "Python machine learning library"
Dense retrieval finds: Documents about "scikit-learn," "TensorFlow," "PyTorch" (semantic similarity - these are ML libraries for Python)
Sparse retrieval finds: Documents containing exact phrase "Python machine learning library"
Hybrid retrieval finds: Both semantically similar libraries AND documents with exact phrase matches, providing comprehensive coverage
✅ Result: Hybrid retrieval gives you the best of both worlds - you don't miss relevant documents.
Key Concepts You'll Learn
- Dense vs Sparse Retrieval: Understanding when to use semantic search vs keyword search, and the trade-offs between them
- Hybrid Retrieval: Combining dense and sparse methods with weighted scoring for optimal results
- BM25 Algorithm: The most popular sparse retrieval algorithm - how it works and why it's better than simple TF-IDF
- Reranking: Using more sophisticated models (cross-encoders, LLMs) to refine initial retrieval results for better accuracy
- Top-k Selection: Choosing how many documents to retrieve - balancing context richness with computational cost
- Similarity Thresholds: Filtering out low-quality matches to ensure only relevant documents are used
- Retrieval Evaluation: Measuring retrieval quality using precision@k, recall@k, and other metrics
Why this matters: Retrieval quality directly determines RAG system performance. Poor retrieval means the LLM receives irrelevant context, leading to inaccurate or hallucinated answers. Good retrieval ensures the LLM has the right information to generate accurate, grounded responses. Most RAG system improvements come from better retrieval, not just better LLMs.
Key Concepts
Retrieval Strategies: Dense, Sparse, and Hybrid Approaches
Retrieval is the heart of RAG systems—it's the process of finding the most relevant documents for a user's query. There are three main approaches, each with distinct characteristics and use cases.
1. Dense Retrieval (Semantic Search)
What it is: Dense retrieval uses vector embeddings to represent both queries and documents in a high-dimensional space, then finds documents whose embeddings are "close" to the query embedding using similarity metrics like cosine similarity.
How it works:
- Convert the query into a dense vector using an embedding model (e.g., SentenceTransformer, OpenAI embeddings)
- Compare this query vector with all document vectors in the vector database
- Calculate similarity scores (cosine similarity, dot product, or Euclidean distance)
- Return the top-k documents with highest similarity scores
Key Advantages:
- ✅ Semantic understanding: Understands meaning, not just keywords. A query about "automobile" will find documents about "car," "vehicle," or "auto" even if those exact words aren't present.
- ✅ Handles synonyms and paraphrasing: "Machine learning" and "ML" are treated as similar concepts. "How do I train a model?" finds documents about "model training" even if they use different wording.
- ✅ Context-aware: Understands that "Python programming" is different from "python snake" based on context.
- ✅ Language-agnostic: Works across languages if using multilingual embeddings (e.g., multilingual-MiniLM)
- ✅ Most common in modern RAG: This is the default approach for most production RAG systems
Limitations:
- ❌ More computationally expensive: Requires embedding models and vector similarity calculations
- ❌ May miss exact keyword matches: If a document uses very specific terminology not well-represented in the embedding model's training data, it might be missed
- ❌ Requires good embeddings: Quality depends heavily on the embedding model used
- ❌ Storage overhead: Need to store vector embeddings for all documents
Example: Dense Retrieval in Action
Query: "How do I train a neural network?"
Document 1: "Training deep learning models requires adjusting hyperparameters like learning rate..." (No exact match for "neural network")
Document 2: "Neural networks are trained using backpropagation..." (Exact match)
Result: Dense retrieval finds BOTH documents because it understands that "neural network" and "deep learning model" are semantically similar concepts, even though Document 1 doesn't contain the exact phrase.
2. Sparse Retrieval (Keyword Search)
What it is: Sparse retrieval uses keyword matching based on term frequency and document frequency. The most common algorithms are BM25 (Best Matching 25) and TF-IDF (Term Frequency-Inverse Document Frequency).
How it works:
- Extract keywords from the query (tokenization, stemming, stop word removal)
- Count how often each keyword appears in each document (term frequency)
- Weight keywords by their rarity across all documents (inverse document frequency - rare words are more important)
- Calculate a relevance score for each document based on keyword matches
- Return the top-k documents with highest scores
Key Advantages:
- ✅ Fast: Very efficient, especially with inverted indexes (traditional search engine technology used by Google, Elasticsearch)
- ✅ Good for exact matches: Excellent when you need to find documents containing specific terms or phrases
- ✅ Interpretable: You can see exactly which keywords matched and why a document was retrieved
- ✅ No embedding model needed: Works with raw text, no neural networks required
- ✅ Low storage: Only stores keyword indexes, not full vectors
Limitations:
- ❌ No semantic understanding: "automobile" won't find "car" unless both terms are present in the document
- ❌ Requires exact term matches: Misses synonyms, paraphrases, and related concepts
- ❌ Language-specific: Needs language-specific tokenization and stemming
- ❌ Poor for conceptual queries: Struggles with queries like "things that help you learn" (no specific keywords)
Example: Sparse Retrieval Limitation
Query: "automobile safety features"
Document: "Modern cars have advanced safety systems including airbags and ABS brakes."
Result: ❌ Sparse retrieval might MISS this document because it doesn't contain "automobile" or "safety features" as exact terms, even though the document is highly relevant (mentions "cars" and "safety systems").
3. Hybrid Retrieval: Best of Both Worlds
What it is: Hybrid retrieval combines dense and sparse retrieval, taking advantage of both approaches' strengths. This is often the best approach for production RAG systems.
How it works:
- Run both dense retrieval and sparse retrieval separately
- Get top-k results from each method (e.g., top 50 from each)
- Combine the results (union or intersection of result sets)
- Rerank the combined results using a weighted score:
final_score = α × dense_score + (1-α) × sparse_scorewhere α is typically 0.3-0.7 - Return the top-k documents from the reranked list
Why Hybrid Works Better:
- ✅ Catches both semantic and exact matches: Dense finds semantically similar docs, sparse finds exact keyword matches
- ✅ More comprehensive: Less likely to miss relevant documents
- ✅ Better for diverse queries: Some queries benefit from semantic search (conceptual questions), others from keyword search (specific term lookups)
- ✅ Production-proven: Used by many production systems (e.g., Google Search uses hybrid approaches)
Example: Hybrid Retrieval
Query: "Python machine learning library"
Dense retrieval finds: Documents about "scikit-learn," "TensorFlow," "PyTorch" (semantic similarity - these are ML libraries for Python)
Sparse retrieval finds: Documents containing exact phrase "Python machine learning library"
Hybrid result: Combines both, ensuring you get both semantically similar libraries AND documents that explicitly mention the exact phrase. This gives you comprehensive coverage.
When to use:
- ✅ When you have diverse query types (some need semantic understanding, others need exact matches)
- ✅ When retrieval quality is critical and you can afford the computational cost
- ✅ For production systems where you want maximum coverage and accuracy
Reranking: Refining Initial Retrieval Results
Why Reranking is Necessary
Initial retrieval (whether dense, sparse, or hybrid) is fast but not perfect. It uses relatively simple similarity calculations that may not capture the nuanced relationship between a query and a document. Reranking is a second-stage refinement that improves relevance by using more sophisticated (but slower) models.
The Problem with Initial Retrieval:
- ❌ Embedding similarity is approximate: Two documents might have similar embeddings but different relevance to a specific query. For example, two documents about "Python" might be similar (both about programming), but one might be about "Python for data science" and the other about "Python web development" - only one is relevant to a query about "data science."
- ❌ Keyword matching is surface-level: A document might contain all query keywords but not actually answer the question. For example, a document might mention "machine learning" and "tutorial" but be about a different topic entirely.
- ❌ Context matters: The same words in different contexts have different meanings. "Bank" could mean financial institution or river bank.
- ❌ Query intent: Initial retrieval doesn't deeply understand what the user is really asking. A query "how to install" might retrieve documents about installation in general, but the user might specifically want installation on Windows.
The Solution: Reranking
Reranking takes the top-k candidates from initial retrieval (e.g., top 100) and uses a more sophisticated model to re-score and reorder them, producing a final top-k list (e.g., top 5-10) that better matches query intent.
Reranking Methods
1. Cross-Encoder Reranking
What it is: A neural network model (typically BERT-based) that processes the query and document together in a single forward pass, allowing the model to see the full interaction between query and document through attention mechanisms.
How it works:
- Input:
[CLS] query [SEP] document [SEP](concatenated query and document) - The model processes both simultaneously, allowing attention mechanisms to identify which parts of the document are most relevant to the query
- Output: A relevance score (typically 0-1 or a similarity score)
- Documents are reranked by these scores
Advantages:
- ✅ More accurate: Significantly better than embedding-based similarity because it sees the full query-document interaction
- ✅ Context-aware: Understands how query and document relate in context
- ✅ Better for nuanced queries: Handles complex questions that require deep understanding
- ✅ Production-ready models: Pre-trained models available (e.g., ms-marco-MiniLM, cross-encoder models from SentenceTransformers)
Disadvantages:
- ❌ Slower: Must process each query-document pair separately (can't batch efficiently for different queries)
- ❌ More expensive: Requires running a neural network for each candidate
- ❌ Limited to top-k: Too slow to use on entire document collection (typically used on top 50-200 candidates)
When to use: When you have a small candidate set (50-200 documents) from initial retrieval and want maximum accuracy. This is the most common reranking approach in production RAG systems.
2. LLM-Based Reranking
What it is: Uses a large language model (like GPT-4, Claude) to score or rank documents based on their relevance to the query.
How it works:
- For each candidate document, create a prompt asking the LLM to score relevance
- Example prompt: "Rate how relevant this document is to the query 'X' on a scale of 1-10. Document: [document text]"
- Use LLM's output to rerank documents
- Alternatively, ask LLM to directly rank documents in order of relevance
Advantages:
- ✅ Most accurate: LLMs have deep understanding of language and context
- ✅ Handles complex queries: Can understand nuanced questions and multi-part queries
- ✅ Flexible: Can be prompted to consider specific factors (relevance, completeness, accuracy)
Disadvantages:
- ❌ Very slow: LLM inference is much slower than specialized reranking models
- ❌ Expensive: API costs can be high if reranking many documents
- ❌ Inconsistent: LLM outputs can be non-deterministic
- ❌ Limited scalability: Not practical for high-volume systems
When to use: For high-value queries where accuracy is critical and you can afford the latency and cost. Often used in research or low-volume production systems.
Reranking Best Practices
- Use two-stage retrieval: Fast initial retrieval (dense/sparse) → slower reranking on top-k candidates
- Typical pipeline: Retrieve top 100-200 → Rerank to top 5-10
- Balance accuracy vs speed: More accurate reranking is slower; choose based on your latency requirements
- Cache reranking results: For common queries, cache reranked results to avoid recomputation
- Monitor performance: Track reranking accuracy, latency, and cost
Mathematical Formulations
Retrieval Scoring Formulas Overview
Retrieval strategies use mathematical formulas to score and rank documents. Understanding these formulas helps you choose the right retrieval method, tune parameters, and optimize performance. This section covers the key formulas used in dense, sparse, hybrid, and reranking approaches.
1. Hybrid Retrieval Score
What This Formula Combines:
Hybrid retrieval combines dense (semantic) and sparse (keyword) retrieval scores using a weighted average. This leverages both semantic understanding and exact keyword matching for better results than either method alone.
Breaking It Down:
- \(\text{cosine}(E(q), E(d))\): Dense retrieval score - cosine similarity between query embedding \(E(q)\) and document embedding \(E(d)\). Range: [-1, 1], typically [0, 1] for normalized embeddings.
- \(\text{BM25}(q, d)\): Sparse retrieval score - BM25 keyword matching score. Range: [0, ∞), but typically normalized to [0, 1] before combination.
- \(\alpha\): Weighting factor controlling the balance between dense and sparse scores. Range: [0, 1]
- \((1 - \alpha)\): Weight for sparse score (ensures weights sum to 1)
Choosing \(\alpha\):
- \(\alpha = 0.7\) (70% dense, 30% sparse): Emphasizes semantic similarity. Good when synonyms and paraphrasing are important.
- \(\alpha = 0.5\) (50/50): Balanced approach, most common in production systems
- \(\alpha = 0.3\) (30% dense, 70% sparse): Emphasizes exact keyword matching. Good when specific terminology is critical.
Why Hybrid Works Better:
Dense retrieval finds semantically similar documents (handles synonyms), while sparse retrieval finds documents with exact keyword matches. Combining both ensures you don't miss relevant documents that use different terminology OR exact phrases.
Example:
Query: "Python machine learning library"
Document 1: "scikit-learn is a Python ML library" (semantic match, no exact phrase)
Dense score: 0.85, BM25 score: 0.60
Hybrid score (α=0.7): \(0.7 \times 0.85 + 0.3 \times 0.60 = 0.775\)
Document 2: "Python machine learning library tutorial" (exact phrase match)
Dense score: 0.80, BM25 score: 0.95
Hybrid score (α=0.7): \(0.7 \times 0.80 + 0.3 \times 0.95 = 0.845\)
✅ Both documents are retrieved, with Document 2 ranked higher due to exact phrase match.
2. BM25 (Best Matching 25) Score
What This Formula Measures:
BM25 is a probabilistic ranking function that scores documents based on how well they match query keywords. It's an improvement over TF-IDF that handles document length normalization better and is widely used in search engines (including Elasticsearch, Solr).
Breaking It Down:
- \(\sum_{t \in q}\): Sum over all terms \(t\) in the query \(q\)
- \(\text{IDF}(t)\): Inverse Document Frequency - measures how rare/important a term is. Formula: \(\text{IDF}(t) = \log \frac{N - n(t) + 0.5}{n(t) + 0.5}\) where \(N\) is total documents and \(n(t)\) is documents containing term \(t\)
- \(f(t, d)\): Term frequency - how many times term \(t\) appears in document \(d\)
- \(\frac{f(t, d) \times (k_1 + 1)}{f(t, d) + k_1 \times (1 - b + b \times \frac{|d|}{\text{avgdl}})}\): Term frequency saturation function - prevents very frequent terms from dominating
- \(k_1\): Term frequency saturation parameter (typically 1.2-2.0). Controls how quickly term frequency saturates
- \(b\): Length normalization parameter (typically 0.75). Controls how much document length affects the score
- \(|d|\): Document length (number of words/tokens)
- \(\text{avgdl}\): Average document length in the collection
Key Components Explained:
IDF Component: Rare terms (low document frequency) get higher IDF scores. This means documents containing rare, specific terms score higher than documents with common terms.
Term Frequency Saturation: The fraction \(\frac{f(t, d) \times (k_1 + 1)}{f(t, d) + k_1 \times \ldots}\) ensures that:
- First occurrence of a term: high contribution
- Additional occurrences: diminishing returns (saturation)
- Prevents documents with excessive term repetition from scoring too high
Length Normalization: The term \(1 - b + b \times \frac{|d|}{\text{avgdl}}\) penalizes very long documents. Longer documents naturally have more term matches, so this normalization prevents them from always ranking highest.
Typical Parameter Values:
- \(k_1 = 1.2\): Standard value, good for most use cases
- \(b = 0.75\): Standard value, moderate length normalization
- \(b = 0\): No length normalization (rarely used)
- \(b = 1\): Full length normalization (strong penalty for long docs)
Example:
Query: "machine learning"
Document 1: "Machine learning is a subset of AI. Machine learning uses algorithms." (2 occurrences, short doc)
Document 2: "Machine learning machine learning machine learning..." (10 occurrences, very long doc)
BM25 gives Document 1 a higher score because:
- Length normalization penalizes Document 2's excessive length
- Term frequency saturation means the extra occurrences in Document 2 don't help much
- Document 1 has a better balance of term frequency and document length
Why BM25 is Better Than TF-IDF:
- ✅ Better length normalization: Handles document length more intelligently
- ✅ Term frequency saturation: Prevents excessive term repetition from dominating scores
- ✅ Proven in practice: Used by major search engines and information retrieval systems
3. Inverse Document Frequency (IDF)
What This Measures:
IDF measures how "rare" or "important" a term is across the entire document collection. Rare terms (appearing in few documents) get higher IDF scores, making them more important for ranking.
Breaking It Down:
- \(N\): Total number of documents in the collection
- \(n(t)\): Number of documents containing term \(t\)
- \(N - n(t)\): Number of documents NOT containing term \(t\)
- \(+ 0.5\): Smoothing factor to avoid division by zero and handle edge cases
- \(\log\): Logarithm (typically natural log or base 2) - compresses the range
Intuition:
If a term appears in many documents (common word like "the", "is"), it's not very informative - low IDF. If a term appears in few documents (rare word like "quantum", "neural"), it's very informative - high IDF.
Example:
Collection: 10,000 documents
Term "the": appears in 9,500 documents
IDF: \(\log \frac{10000 - 9500 + 0.5}{9500 + 0.5} = \log \frac{500.5}{9500.5} \approx -2.95\) (very low, common word)
Term "quantum": appears in 50 documents
IDF: \(\log \frac{10000 - 50 + 0.5}{50 + 0.5} = \log \frac{9950.5}{50.5} \approx 5.3\) (very high, rare and informative)
When query contains "quantum computing", documents with "quantum" score much higher because it's a rare, informative term.
4. Cross-Encoder Reranking Score
What This Represents:
A cross-encoder is a neural network model (typically BERT-based) that processes the query and document together in a single forward pass. The notation \([q; d]\) means the query and document are concatenated: [CLS] query [SEP] document [SEP].
Breaking It Down:
- \(q\): Query text
- \(d\): Document text
- \([q; d]\): Concatenated input: query and document together
- \(\text{CrossEncoder}\): Neural network model that processes the concatenated input
- Output: Relevance score (typically 0-1 or similarity score)
Why Cross-Encoder is More Accurate:
Bi-encoder (embedding-based): Processes query and document separately, then compares embeddings. Fast but loses interaction information.
Cross-encoder: Processes query and document together, allowing attention mechanisms to identify which parts of the document are most relevant to the query. More accurate but slower.
Attention Mechanism:
The cross-encoder's attention mechanism allows the model to "look" at specific parts of the document when processing the query. For example, when the query is "capital of France", the model can focus attention on the part of the document that mentions "Paris" or "capital".
Performance Trade-off:
- Accuracy: Cross-encoder typically achieves 10-30% better accuracy than bi-encoder
- Speed: Cross-encoder is 10-100x slower because it must process each (query, document) pair separately
- Use case: Rerank top-50 to top-200 candidates from initial retrieval (too slow for full collection)
Example:
Query: "How to train a neural network?"
Document: "Training deep learning models requires adjusting hyperparameters..."
Bi-encoder: Embed query and document separately, compare embeddings → score: 0.75
Cross-encoder: Process "[CLS] How to train a neural network? [SEP] Training deep learning models requires adjusting hyperparameters... [SEP]" → score: 0.92
✅ Cross-encoder sees the full interaction and understands that "neural network" and "deep learning models" are related in this context.
5. Normalized Hybrid Score
What This Does:
Before combining dense and sparse scores, they should be normalized to the same scale (typically [0, 1]). This ensures that one method doesn't dominate simply because its scores are larger.
Breaking It Down:
- Min-max normalization: \(\frac{x - \min}{\max - \min}\) scales values to [0, 1] range
- \(\min_{\text{cosine}}, \max_{\text{cosine}}\): Minimum and maximum cosine similarity scores across all documents for this query
- \(\min_{\text{BM25}}, \max_{\text{BM25}}\): Minimum and maximum BM25 scores across all documents for this query
- After normalization, both scores are in [0, 1] range, making them comparable
Why Normalization is Important:
Without normalization, if BM25 scores range from 0-100 and cosine scores range from 0-1, the BM25 component would dominate the hybrid score regardless of the \(\alpha\) value. Normalization ensures both components contribute proportionally.
Example:
For a query, you get:
- Cosine scores: [0.65, 0.72, 0.58, 0.80] → normalized: [0.27, 0.64, 0.0, 1.0]
- BM25 scores: [12.5, 45.3, 8.2, 67.8] → normalized: [0.07, 0.62, 0.0, 1.0]
Now both are on the same [0, 1] scale and can be meaningfully combined with \(\alpha = 0.7\).
Detailed Examples
Step-by-Step Examples
Example: Hybrid Search
Query: "Python machine learning library"
Vector search: Finds documents semantically similar (e.g., "scikit-learn", "TensorFlow")
Keyword search: Finds documents with exact terms ("Python", "machine learning", "library")
Hybrid: Combines both, retrieves documents that are both semantically relevant AND contain keywords
Result: Better retrieval quality than either method alone.
Example: Reranking
Initial retrieval: Top-100 documents from vector search
Reranking: Use cross-encoder to score each (query, document) pair
Result: Top-5 most relevant documents after reranking
Benefit: Cross-encoder sees full query and document, more accurate than embedding similarity alone.
Implementation
Implementation Overview
This section provides practical Python code examples for implementing retrieval strategies in RAG systems. The examples demonstrate dense retrieval, sparse retrieval (BM25), hybrid retrieval, and reranking with cross-encoders. These implementations form the core of production RAG systems.
1. Complete Hybrid Retrieval System
What this does: Implements a production-ready hybrid retrieval system that combines dense (semantic) and sparse (keyword) retrieval with proper score normalization and weighting.
from sentence_transformers import SentenceTransformer
from rank_bm25 import BM25Okapi
import numpy as np
from typing import List, Dict, Optional
class HybridRetriever:
"""
Production-ready hybrid retrieval system.
Combines dense (semantic) and sparse (keyword) retrieval
for optimal results in RAG systems.
"""
def __init__(self, documents: List[str], alpha: float = 0.7):
"""
Initialize hybrid retriever.
Args:
documents: List of document strings to index
alpha: Weight for sparse retrieval (1-alpha for dense)
alpha=0.7 means 70% sparse, 30% dense
"""
if not documents:
raise ValueError("Documents list cannot be empty")
self.documents = documents
self.alpha = alpha
# Initialize dense retriever (embedding-based)
print("Initializing embedding model...")
self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
self.embeddings = self.embedder.encode(documents, show_progress_bar=True)
print(f"Generated {len(self.embeddings)} embeddings (dimension: {self.embeddings.shape[1]})")
# Initialize sparse retriever (BM25)
print("Building BM25 index...")
tokenized_docs = [self._tokenize(doc) for doc in documents]
self.bm25 = BM25Okapi(tokenized_docs)
print("BM25 index built")
def _tokenize(self, text: str) -> List[str]:
"""Tokenize text for BM25 (simple word splitting)."""
# In production, use proper tokenization (NLTK, spaCy)
return text.lower().split()
def retrieve(self, query: str, top_k: int = 5, normalize: bool = True) -> List[Dict]:
"""
Retrieve documents using hybrid approach.
Args:
query: User query string
top_k: Number of documents to retrieve
normalize: Whether to normalize scores before combining
Returns:
List of dicts with 'document', 'dense_score', 'sparse_score', 'hybrid_score'
"""
# Step 1: Dense retrieval (semantic search)
query_embedding = self.embedder.encode([query])
dense_scores = np.dot(self.embeddings, query_embedding.T).flatten()
# Step 2: Sparse retrieval (BM25 keyword search)
tokenized_query = self._tokenize(query)
sparse_scores = self.bm25.get_scores(tokenized_query)
# Step 3: Normalize scores to [0, 1] range for fair combination
if normalize:
dense_scores = self._normalize(dense_scores)
sparse_scores = self._normalize(sparse_scores)
# Step 4: Combine scores
# alpha * sparse + (1-alpha) * dense
hybrid_scores = self.alpha * sparse_scores + (1 - self.alpha) * dense_scores
# Step 5: Get top-k
top_indices = np.argsort(hybrid_scores)[-top_k:][::-1]
# Format results
results = []
for idx in top_indices:
results.append({
'document': self.documents[idx],
'dense_score': float(dense_scores[idx]),
'sparse_score': float(sparse_scores[idx]),
'hybrid_score': float(hybrid_scores[idx]),
'rank': len(results) + 1
})
return results
def _normalize(self, scores: np.ndarray) -> np.ndarray:
"""Normalize scores to [0, 1] range using min-max normalization."""
min_score = scores.min()
max_score = scores.max()
if max_score - min_score < 1e-8: # Avoid division by zero
return np.ones_like(scores)
return (scores - min_score) / (max_score - min_score)
# Example usage
documents = [
"Python machine learning library scikit-learn provides tools for data analysis",
"TensorFlow is a deep learning framework for neural networks",
"Machine learning algorithms learn patterns from data automatically",
"Natural language processing helps computers understand human language",
"Python programming language is popular for data science"
]
retriever = HybridRetriever(documents, alpha=0.6) # 60% sparse, 40% dense
# Query
query = "Python machine learning"
results = retriever.retrieve(query, top_k=3)
print(f"\nQuery: '{query}'")
print(f"\nTop-3 Results:")
for result in results:
print(f"\nRank {result['rank']}:")
print(f" Document: {result['document']}")
print(f" Dense score: {result['dense_score']:.3f}")
print(f" Sparse score: {result['sparse_score']:.3f}")
print(f" Hybrid score: {result['hybrid_score']:.3f}")
Key Points:
- Score normalization: Critical for combining dense and sparse scores fairly
- Alpha parameter: Tune alpha (0.3-0.7) based on your use case
- Performance: Hybrid is slightly slower than either method alone, but provides better results
- Production use: This is the standard approach in most production RAG systems
2. BM25 Implementation from Scratch
What this does: Implements BM25 algorithm from scratch to understand how it works. For production, use libraries like `rank-bm25`, but this shows the math.
import math
from collections import Counter
from typing import List, Dict
class BM25:
"""
BM25 (Best Matching 25) ranking function implementation.
BM25 is a probabilistic ranking function that scores documents
based on term frequency, inverse document frequency, and document length.
"""
def __init__(self, documents: List[str], k1: float = 1.2, b: float = 0.75):
"""
Initialize BM25 index.
Args:
documents: List of document strings
k1: Term frequency saturation parameter (typically 1.2)
b: Length normalization parameter (typically 0.75)
"""
self.k1 = k1
self.b = b
self.documents = documents
# Tokenize all documents
self.tokenized_docs = [self._tokenize(doc) for doc in documents]
# Calculate document frequencies and average document length
self.doc_freqs = self._calculate_document_frequencies()
self.avg_doc_length = sum(len(doc) for doc in self.tokenized_docs) / len(documents)
self.total_docs = len(documents)
def _tokenize(self, text: str) -> List[str]:
"""Simple tokenization (in production, use proper tokenizer)."""
return text.lower().split()
def _calculate_document_frequencies(self) -> Dict[str, int]:
"""Calculate how many documents contain each term."""
doc_freqs = Counter()
for doc_tokens in self.tokenized_docs:
unique_terms = set(doc_tokens)
for term in unique_terms:
doc_freqs[term] += 1
return doc_freqs
def _idf(self, term: str) -> float:
"""
Calculate Inverse Document Frequency (IDF) for a term.
Formula: log((N - n(t) + 0.5) / (n(t) + 0.5))
where N = total docs, n(t) = docs containing term
"""
if term not in self.doc_freqs:
return 0.0
n_t = self.doc_freqs[term] # Number of documents containing term
# BM25 IDF formula
idf = math.log((self.total_docs - n_t + 0.5) / (n_t + 0.5))
return max(0.0, idf) # Ensure non-negative
def score(self, query: str, doc_index: int) -> float:
"""
Calculate BM25 score for a query-document pair.
Args:
query: Query string
doc_index: Index of document in documents list
Returns:
BM25 score
"""
query_terms = self._tokenize(query)
doc_tokens = self.tokenized_docs[doc_index]
doc_length = len(doc_tokens)
score = 0.0
for term in query_terms:
# Term frequency in document
tf = doc_tokens.count(term)
if tf == 0:
continue # Term not in document, skip
# IDF component
idf = self._idf(term)
# Term frequency saturation component
numerator = tf * (self.k1 + 1)
denominator = tf + self.k1 * (1 - self.b + self.b * (doc_length / self.avg_doc_length))
# BM25 score for this term
term_score = idf * (numerator / denominator)
score += term_score
return score
def get_scores(self, query: str) -> List[float]:
"""Get BM25 scores for query against all documents."""
return [self.score(query, i) for i in range(len(self.documents))]
def get_top_k(self, query: str, k: int = 5) -> List[tuple]:
"""
Get top-k documents for query.
Returns:
List of (document_index, score) tuples, sorted by score descending
"""
scores = self.get_scores(query)
# Get indices sorted by score (descending)
top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k]
return [(i, scores[i]) for i in top_indices]
# Example usage
documents = [
"machine learning algorithms learn from data",
"deep learning uses neural networks",
"machine learning is a subset of artificial intelligence",
"neural networks are used in deep learning"
]
bm25 = BM25(documents)
query = "machine learning"
scores = bm25.get_scores(query)
print(f"BM25 scores for query '{query}':")
for i, (doc, score) in enumerate(zip(documents, scores)):
print(f" Doc {i+1}: {score:.3f} - {doc}")
top_k = bm25.get_top_k(query, k=2)
print(f"\nTop-2 documents:")
for doc_idx, score in top_k:
print(f" Score: {score:.3f}")
print(f" Document: {documents[doc_idx]}\n")
Key Points:
- IDF calculation: Rare terms get higher IDF (more informative)
- Term frequency saturation: Prevents excessive term repetition from dominating
- Length normalization: Prevents long documents from always ranking highest
- Parameters: k1=1.2, b=0.75 are standard values, but can be tuned
3. Cross-Encoder Reranking Implementation
What this does: Implements reranking using cross-encoder models to refine initial retrieval results for better accuracy.
from sentence_transformers import CrossEncoder
import numpy as np
from typing import List, Dict
class Reranker:
"""
Cross-encoder based reranker for improving retrieval accuracy.
Takes top-k candidates from initial retrieval and reranks them
using a more sophisticated model that sees query-document interaction.
"""
def __init__(self, model_name: str = 'cross-encoder/ms-marco-MiniLM-L-6-v2'):
"""
Initialize reranker with cross-encoder model.
Args:
model_name: Pre-trained cross-encoder model name
"""
print(f"Loading cross-encoder model: {model_name}...")
self.model = CrossEncoder(model_name)
print("Model loaded")
def rerank(self, query: str, documents: List[str], top_k: int = 3) -> List[Dict]:
"""
Rerank documents for a query.
Args:
query: User query
documents: List of candidate documents to rerank
top_k: Number of top documents to return
Returns:
List of dicts with 'document', 'score', 'rank'
"""
if not documents:
return []
# Create query-document pairs
# Format: [query, document] for each document
pairs = [[query, doc] for doc in documents]
# Get relevance scores from cross-encoder
# Higher score = more relevant
scores = self.model.predict(pairs)
# Sort by score (descending) and get top-k
ranked_indices = np.argsort(scores)[::-1][:top_k]
# Format results
results = []
for rank, idx in enumerate(ranked_indices, 1):
results.append({
'document': documents[idx],
'score': float(scores[idx]),
'rank': rank
})
return results
def rerank_batch(self, queries: List[str], documents: List[str], top_k: int = 3) -> List[List[Dict]]:
"""
Rerank documents for multiple queries (batch processing).
Args:
queries: List of queries
documents: List of candidate documents (same for all queries)
top_k: Number of top documents per query
Returns:
List of results (one list per query)
"""
all_results = []
for query in queries:
results = self.rerank(query, documents, top_k)
all_results.append(results)
return all_results
# Example: Two-stage retrieval pipeline
class TwoStageRetriever:
"""
Complete two-stage retrieval: fast initial retrieval + accurate reranking.
This is the standard production approach:
1. Fast initial retrieval (dense/sparse/hybrid) gets top-100 candidates
2. Reranker refines to top-5 most relevant
"""
def __init__(self, initial_retriever, reranker):
self.initial_retriever = initial_retriever # HybridRetriever or similar
self.reranker = reranker
def retrieve(self, query: str, initial_k: int = 100, final_k: int = 5):
"""
Two-stage retrieval pipeline.
Args:
query: User query
initial_k: Number of candidates from initial retrieval
final_k: Number of final results after reranking
"""
# Stage 1: Fast initial retrieval
print(f"Stage 1: Initial retrieval (top-{initial_k})...")
initial_results = self.initial_retriever.retrieve(query, top_k=initial_k)
initial_docs = [r['document'] for r in initial_results]
# Stage 2: Rerank top candidates
print(f"Stage 2: Reranking top-{initial_k} to get top-{final_k}...")
reranked = self.reranker.rerank(query, initial_docs, top_k=final_k)
return reranked
# Example usage
from sentence_transformers import SentenceTransformer
# Initialize components
embedder = SentenceTransformer('all-MiniLM-L6-v2')
initial_retriever = HybridRetriever(documents, alpha=0.6)
reranker = Reranker()
# Two-stage retrieval
two_stage = TwoStageRetriever(initial_retriever, reranker)
query = "Python machine learning"
final_results = two_stage.retrieve(query, initial_k=10, final_k=3)
print(f"\nFinal top-3 after reranking:")
for result in final_results:
print(f"Rank {result['rank']}: {result['document']} (score: {result['score']:.3f})")
Key Points:
- Two-stage approach: Fast initial retrieval (100-200 docs) → accurate reranking (top 5-10)
- Accuracy improvement: Reranking typically improves precision@5 by 10-30%
- Latency trade-off: Adds 50-200ms but significantly improves quality
- Production standard: Most production RAG systems use this approach
Installation Requirements
Install the required packages:
pip install sentence-transformers rank-bm25 numpy
Note: For production, consider using vector databases (Pinecone, Chroma) for initial retrieval instead of in-memory implementations shown here.
Real-World Applications
Where This Is Used
Retrieval Strategies in Practice
When to use vector search: Semantic similarity is important, synonyms matter, multilingual content
When to use keyword search: Exact term matching needed, domain-specific terminology
When to use hybrid: Best of both worlds, production systems often use this
When to use reranking: Need highest accuracy, can afford extra latency
Performance Trade-offs
Vector search: Fast, good semantic understanding, but may miss exact matches
Keyword search: Very fast, exact matches, but misses synonyms
Hybrid: Balanced, combines strengths, slightly slower
Reranking: Most accurate, but slowest (processes each candidate)