Chapter 6: Advanced RAG Techniques
Improving Performance
Learning Objectives
- Understand advanced rag techniques fundamentals
- Master the mathematical foundations
- Learn practical implementation
- Apply knowledge through examples
- Recognize real-world applications
Advanced RAG Techniques
Beyond Basic RAG: Optimizing for Real-World Performance
Basic RAG (retrieve top-k documents, pass to LLM, generate answer) works well for simple queries, but real-world applications face complex challenges: ambiguous queries, context window limits, multi-part questions, and the need for higher accuracy. Advanced RAG techniques address these challenges to significantly improve system performance.
Common problems with basic RAG:
- ❌ Query ambiguity: Short queries like "Python ML" are unclear - does the user want Python machine learning libraries, Python ML algorithms, or something else?
- ❌ Context overflow: Retrieved chunks might exceed LLM context window limits, forcing truncation and loss of information
- ❌ Incomplete answers: Complex questions require information from multiple documents, but basic RAG retrieves once and generates once
- ❌ Precision vs context trade-off: Small chunks are precise but lack context; large chunks have context but are less precise
- ❌ Single retrieval limitation: One retrieval pass might miss relevant documents that use different terminology
Advanced Techniques That Solve These Problems
- Query Expansion: Automatically expand or rewrite queries to include synonyms, related terms, and alternative phrasings before retrieval
- Multi-Query Retrieval: Generate multiple query variations, retrieve for each, and combine results for comprehensive coverage
- Parent-Child Chunking: Store small chunks for precise retrieval, but include parent document context when generating answers
- Context Compression: Summarize or extract only relevant parts of retrieved chunks to fit within context windows
- Iterative Retrieval: If initial answer is incomplete, generate follow-up queries and retrieve more context (multi-hop retrieval)
- Self-RAG: Autonomous system where the model decides when to retrieve, when to generate, and when to stop
Example: Multi-Query Retrieval in Action
Original query: "How to train a neural network?"
Generated variations:
- "neural network training process"
- "how to train deep learning models"
- "backpropagation and gradient descent for neural networks"
Result: Each variation retrieves slightly different documents. Combined, you get comprehensive coverage of the topic - you don't miss relevant information that uses different terminology.
Key Concepts You'll Learn
- Query Expansion: Techniques to improve retrieval by expanding queries with synonyms and related terms
- Multi-Query Retrieval: Generating multiple query variations and combining results for better coverage
- Parent-Child Chunking: Hierarchical chunking strategy that balances retrieval precision with generation context
- Context Compression: Methods to reduce retrieved context size (summarization, sentence extraction) while preserving important information
- Iterative Retrieval: Multi-hop retrieval where follow-up queries retrieve additional context if initial answer is incomplete
- Self-RAG: Advanced autonomous RAG where the model controls retrieval and generation decisions
- Metadata Filtering: Combining vector search with traditional filters to reduce search space and improve precision
Why this matters: These advanced techniques can improve RAG system accuracy by 20-40% compared to basic RAG. They're essential for production systems where accuracy, completeness, and user experience are critical. Understanding when and how to apply these techniques is what separates basic RAG implementations from production-grade systems.
Key Concepts
Advanced RAG Techniques: Beyond Basic Retrieval
Basic RAG (retrieve top-k documents, pass to LLM) works well for simple queries, but real-world applications often need more sophisticated techniques to handle complex queries, improve accuracy, and optimize performance.
1. Query Expansion: Improving Retrieval Coverage
What it is: Query expansion rewrites or augments the user's query to include synonyms, related terms, or alternative phrasings before retrieval. This helps find documents that use different terminology than the original query.
Why it's needed: Users often write short, ambiguous queries. A query like "Python ML" could mean many things, and documents might use different terminology ("machine learning," "deep learning," "scikit-learn," etc.).
How it works:
- Synonym expansion: Add synonyms for key terms (e.g., "ML" → "machine learning," "deep learning")
- Related term expansion: Add related concepts (e.g., "Python ML" → "scikit-learn," "TensorFlow," "PyTorch")
- LLM-based expansion: Use an LLM to generate query variations or rewrite the query more clearly
- Retrieve with expanded query: Use the expanded query for retrieval (or retrieve for each variation and combine)
Example: Query Expansion
Original query: "Python ML"
Expanded query: "Python machine learning libraries scikit-learn TensorFlow PyTorch data science"
Result: ✅ Retrieves documents that mention "scikit-learn" or "TensorFlow" even if they don't contain "Python ML" as a phrase.
When to use: When queries are short, ambiguous, or use abbreviations. Particularly useful for technical domains with lots of synonyms and jargon.
2. Multi-Query Retrieval: Generating Multiple Query Variations
What it is: Instead of retrieving with a single query, generate multiple query variations, retrieve documents for each variation, then combine and deduplicate the results.
Why it works: Different phrasings of the same question might retrieve different relevant documents. By generating multiple variations, you increase the chance of finding all relevant information.
How it works:
- Generate query variations: Use an LLM to generate 3-5 different phrasings of the original query
- Retrieve for each: Run retrieval for each query variation
- Combine results: Merge all retrieved documents, removing duplicates
- Rerank: Rerank the combined results to get the most relevant documents
Example: Multi-Query Retrieval
Original query: "How to train a neural network?"
Generated variations:
- "neural network training process"
- "how to train deep learning models"
- "backpropagation and gradient descent for neural networks"
Result: ✅ Each variation might retrieve slightly different documents, giving you comprehensive coverage of the topic.
When to use: For complex queries where a single phrasing might miss relevant documents. Particularly effective when combined with query expansion.
3. Parent-Child Chunking: Balancing Precision and Context
What it is: A hierarchical chunking strategy where you store small, precise chunks for retrieval (children), but also store larger parent chunks that contain surrounding context.
The problem it solves: Small chunks are better for precise retrieval (more likely to be fully relevant), but they lack context. Large chunks have more context but are less precise (may contain irrelevant information).
How it works:
- Create parent chunks: Split documents into larger chunks (e.g., 1000 tokens) that preserve context
- Create child chunks: Split each parent into smaller chunks (e.g., 200 tokens) for precise retrieval
- Store both: Embed and store both parent and child chunks, with metadata linking children to parents
- Retrieve children: Use small child chunks for retrieval (precise matching)
- Include parent context: When a child is retrieved, also include its parent chunk in the context sent to the LLM
Example: Parent-Child Chunking
Parent chunk (1000 tokens): "Machine learning models require careful tuning. Hyperparameters like learning rate significantly impact performance. Regularization techniques help prevent overfitting..."
Child chunk 1 (200 tokens): "Hyperparameters like learning rate significantly impact performance."
Query: "What is learning rate?"
Result: ✅ Child chunk 1 is retrieved (precise match), but parent chunk is also included, providing context about hyperparameters and regularization.
When to use: When you need both precise retrieval (small chunks) and rich context (large chunks). Common in production RAG systems.
4. Metadata Filtering: Reducing Search Space
What it is: Apply filters based on document metadata (date, author, category, source) before performing vector similarity search. This reduces the number of documents to search.
Why it's important: Searching 1 million documents is slow. If you can filter to 50,000 relevant documents first (e.g., "only documents from 2023"), search becomes much faster and more accurate.
How it works:
- Extract metadata: When indexing, extract and store metadata (date, category, author, etc.)
- Apply filters: Before vector search, filter documents by metadata criteria
- Search filtered set: Perform vector similarity search only on filtered documents
- Return results: Return top-k documents from the filtered search
When to use: When documents have meaningful metadata and queries can be scoped (e.g., "recent articles," "technical documentation," "from specific author").
Context Compression: Fitting Retrieved Information into LLM Windows
The Problem
After retrieving top-k documents, you might have 10,000+ tokens of context. But LLMs have context window limits (e.g., GPT-4: 8K-128K tokens, Claude: 100K-200K tokens). If your query + context + answer exceeds the limit, you need to compress the context.
Context Compression Techniques
1. Summarization
What it is: Use an LLM to summarize each retrieved chunk, reducing token count while preserving key information.
How it works:
- For each retrieved chunk, create a summary prompt: "Summarize this text in 2-3 sentences: [chunk text]"
- Generate summaries (much shorter than originals)
- Use summaries as context instead of full chunks
Trade-offs: ✅ Reduces tokens significantly. ❌ May lose important details. ⚠️ Adds latency (need to summarize each chunk).
2. Sentence Extraction
What it is: Extract only the most relevant sentences from retrieved chunks, discarding the rest.
How it works:
- Score each sentence in retrieved chunks for relevance to the query (using embeddings or LLM)
- Select top-N most relevant sentences
- Use only these sentences as context
Trade-offs: ✅ Very fast. ✅ Preserves exact information (no summarization loss). ❌ May lose context between sentences.
3. LLM-Based Compression
What it is: Use an LLM to compress the entire retrieved context into a shorter version that preserves information relevant to the query.
How it works:
- Prompt: "Compress this context to [target tokens] while preserving all information relevant to: [query]"
- LLM generates compressed version
- Use compressed context for final answer generation
Trade-offs: ✅ Most intelligent compression. ✅ Preserves query-relevant information. ❌ Expensive and slow. ❌ May introduce hallucinations.
4. Relevance-Based Prioritization
What it is: Instead of compressing, prioritize chunks by relevance score and include only the most relevant ones until you hit the token limit.
How it works:
- Sort retrieved chunks by similarity score (most relevant first)
- Add chunks to context one by one until you reach token limit
- Discard remaining chunks
Trade-offs: ✅ Fast and simple. ✅ No information loss (just truncation). ❌ May miss important information in lower-ranked chunks.
Best Practices
- Combine techniques: Use summarization for less relevant chunks, full text for most relevant
- Reserve tokens for answer: Leave 20-30% of context window for the LLM's answer
- Monitor compression quality: Track answer quality with and without compression
Iterative Retrieval: Multi-Hop and Self-RAG
1. Multi-Hop Retrieval
What it is: If the initial answer is incomplete or the model needs more information, generate follow-up queries, retrieve additional context, and repeat until the answer is complete.
Why it's needed: Complex questions often require information from multiple documents. A single retrieval might not find all necessary information.
How it works:
- Initial retrieval: Retrieve top-k documents for the original query
- Generate answer: Attempt to generate answer from retrieved context
- Check completeness: Determine if answer is complete (using LLM or heuristics)
- Generate follow-up query: If incomplete, generate a new query to find missing information
- Retrieve again: Retrieve documents for the follow-up query
- Combine context: Add new context to existing context
- Generate final answer: Generate answer from combined context
- Repeat if needed: Continue until answer is complete or max iterations reached
Example: Multi-Hop Retrieval
Query: "What is the capital of France and what is its population?"
Hop 1: Query "capital of France" → Retrieves document about Paris being the capital
Hop 2: Query "Paris population" → Retrieves document with population statistics
Result: ✅ Combines information from both hops to answer the complete question.
When to use: For complex, multi-part questions that require information from multiple sources. Common in question-answering systems.
2. Self-RAG: Autonomous Retrieval and Generation
What it is: An advanced technique where the model itself decides when to retrieve, what to retrieve, when to generate, and when to stop—making the RAG system more autonomous.
How it works:
- Retrieve decision: Model decides if retrieval is needed (some queries can be answered from its training data)
- Query generation: If retrieval needed, model generates the retrieval query
- Retrieval: Retrieve documents using generated query
- Relevance check: Model evaluates if retrieved documents are relevant
- Generate or retrieve again: If relevant, generate answer; if not, retrieve again with different query
- Self-critique: Model evaluates its own answer for completeness and accuracy
- Iterate or stop: If answer is good, stop; if not, continue retrieving and generating
Advantages:
- ✅ More autonomous: Doesn't always retrieve (saves cost when not needed)
- ✅ Adaptive: Adjusts retrieval strategy based on query complexity
- ✅ Self-correcting: Can identify when answers are incomplete and retrieve more
Challenges:
- ❌ Complex to implement: Requires fine-tuning or prompt engineering
- ❌ Higher latency: Multiple decision points add time
- ❌ Cost: More LLM calls (retrieve decisions, query generation, self-critique)
When to use: For high-value applications where accuracy is critical and you can afford the complexity and cost. Still experimental but promising.
Mathematical Formulations
Advanced RAG Mathematical Models
Advanced RAG techniques involve mathematical models for generation, context management, and quality evaluation. Understanding these formulas helps you optimize context usage, implement iterative retrieval, and measure answer quality.
1. RAG Generation Probability
What This Formula Represents:
This is the fundamental probability model for RAG generation. The LLM generates the answer token by token, where each token's probability depends on the query, retrieved context, and all previously generated tokens.
Breaking It Down:
- \(P(\text{answer} | \text{query}, \text{context})\): Probability of generating the entire answer given the query and retrieved context
- \(\prod_{i=1}^{n}\): Product over all \(n\) tokens in the answer (multiplication of probabilities)
- \(P(\text{token}_i | \text{query}, \text{context}, \text{tokens}_{ Probability of generating token \(i\) given:
- The query (user's question)
- The context (retrieved documents)
- All previous tokens (\(\text{tokens}_{
Why It's a Product:
To generate the full answer, the model must generate token 1, THEN token 2 given token 1, THEN token 3 given tokens 1 and 2, and so on. The probability of the entire sequence is the product of these conditional probabilities (chain rule of probability).
Key Insight:
The formula shows that the answer depends on BOTH the query AND the context. This is what makes RAG different from standard LLM generation - the context from retrieved documents directly influences each token's probability.
Example:
Query: "What is the capital of France?"
Context: "France is a country in Europe. Its capital city is Paris."
The model generates:
- Token 1 ("The"): \(P(\text{"The"} | \text{query}, \text{context})\) - high probability because context mentions "capital"
- Token 2 ("capital"): \(P(\text{"capital"} | \text{query}, \text{context}, \text{"The"})\) - high probability, matches context
- Token 3 ("of"): \(P(\text{"of"} | \text{query}, \text{context}, \text{"The capital"})\) - high probability, grammatical continuation
- Token 4 ("France"): \(P(\text{"France"} | \text{query}, \text{context}, \text{"The capital of"})\) - high probability, matches query
- Token 5 ("is"): \(P(\text{"is"} | \text{query}, \text{context}, \text{"The capital of France"})\) - high probability
- Token 6 ("Paris"): \(P(\text{"Paris"} | \text{query}, \text{context}, \text{"The capital of France is"})\) - very high probability, directly from context!
The final answer "The capital of France is Paris" has high probability because each token is likely given the context.
2. Context Window Constraint
What This Constraint Enforces:
LLMs have fixed context window limits. The total number of tokens (query + retrieved context + generated answer) must fit within this limit. This constraint determines how much context you can include and how long answers can be.
Breaking It Down:
- \(|\text{query}|\): Number of tokens in the user's query (typically 10-50 tokens)
- \(|\text{context}|\): Number of tokens in retrieved documents (can be 500-4000+ tokens depending on top-k and chunk size)
- \(|\text{answer}|\): Number of tokens in the generated answer (variable, depends on query complexity)
- \(\text{max\_context\_length}\): Maximum tokens the LLM can process (e.g., GPT-4: 8K-128K, Claude: 100K-200K, GPT-3.5: 4K-16K)
Practical Implications:
If your context window is 4,000 tokens and your query is 50 tokens, you have ~3,950 tokens available for context + answer. If you retrieve 3,000 tokens of context, you only have ~950 tokens left for the answer. This is why context compression is important!
Example:
Scenario: GPT-3.5-turbo with 4,096 token context window
- Query: 30 tokens
- Retrieved context: 3,500 tokens (5 chunks × 700 tokens each)
- Available for answer: 4,096 - 30 - 3,500 = 566 tokens
- ✅ Fits, but close to limit
Problem scenario:
- Query: 30 tokens
- Retrieved context: 4,200 tokens (too much!)
- Available for answer: 4,096 - 30 - 4,200 = -134 tokens ❌
- Solution: Compress context to 3,000 tokens, leaving 1,066 tokens for answer
Strategies to Handle This:
- Context compression: Summarize or extract relevant parts to reduce \(|\text{context}|\)
- Prioritize chunks: Include only top-k most relevant chunks
- Reserve tokens: Leave 20-30% of context window for answer generation
- Use larger context windows: GPT-4 (128K) or Claude (200K) for very long contexts
3. Answer Quality Score
What This Formula Measures:
Answer quality in RAG systems is multi-dimensional. This formula combines three critical dimensions into a single quality score, allowing you to evaluate and optimize RAG system performance.
Breaking It Down:
- \(\text{relevance}\): How well the answer addresses the query. Measured by semantic similarity between query and answer embeddings, or by human/LM evaluation. Range: [0, 1]
- \(\text{faithfulness}\): How well the answer is grounded in the retrieved context (not hallucinated). Measured as fraction of answer claims supported by context. Range: [0, 1]
- \(\text{completeness}\): How complete the answer is - does it fully address all parts of the query? Measured by coverage of query aspects. Range: [0, 1]
- \(\alpha, \beta, \gamma\): Weighting factors that sum to 1.0, controlling the relative importance of each dimension
Typical Weightings:
- Balanced: \(\alpha = 0.4, \beta = 0.4, \gamma = 0.2\) - Equal emphasis on relevance and faithfulness, less on completeness
- Faithfulness-focused: \(\alpha = 0.3, \beta = 0.5, \gamma = 0.2\) - Emphasizes avoiding hallucinations (critical for factual domains)
- Completeness-focused: \(\alpha = 0.3, \beta = 0.3, \gamma = 0.4\) - Emphasizes comprehensive answers (good for complex queries)
Example:
Query: "What are the main advantages of RAG?"
Answer 1: "RAG is good."
Relevance: 0.6 (somewhat relevant but vague)
Faithfulness: 1.0 (supported by context)
Completeness: 0.2 (very incomplete)
Quality (α=0.4, β=0.4, γ=0.2): \(0.4 \times 0.6 + 0.4 \times 1.0 + 0.2 \times 0.2 = 0.68\)
Answer 2: "RAG has several advantages: no training required, easy to update knowledge, can cite sources, works with any LLM."
Relevance: 0.95 (highly relevant)
Faithfulness: 0.9 (mostly supported, minor elaboration)
Completeness: 0.9 (covers main advantages)
Quality (α=0.4, β=0.4, γ=0.2): \(0.4 \times 0.95 + 0.4 \times 0.9 + 0.2 \times 0.9 = 0.92\)
✅ Answer 2 scores much higher due to better relevance and completeness.
Using This for Optimization:
Track quality scores over time. If faithfulness drops, you might need better retrieval or context. If relevance drops, you might need better query understanding. If completeness drops, you might need to retrieve more documents or use iterative retrieval.
4. Multi-Query Retrieval Coverage
What This Measures:
Multi-query retrieval generates \(m\) query variations and retrieves documents for each. This formula measures what fraction of all relevant documents were found by at least one query variation.
Breaking It Down:
- \(q_1, q_2, \ldots, q_m\): \(m\) different query variations (e.g., 3-5 variations)
- \(\text{Retrieve}(q_i, D)\): Documents retrieved for query variation \(q_i\)
- \(\bigcup_{i=1}^{m}\): Union of all retrieved document sets (removes duplicates)
- \(|D_{\text{relevant}}|\): Total number of relevant documents in the knowledge base
- Coverage: Fraction of relevant documents found by at least one query variation
Why Multi-Query Improves Coverage:
Different query phrasings might retrieve different relevant documents. By generating multiple variations and taking the union, you increase the chance of finding all relevant documents.
Example:
Original query: "How to train a neural network?"
Query variations:
- \(q_1\): "neural network training process" → retrieves docs: {A, B, C}
- \(q_2\): "how to train deep learning models" → retrieves docs: {B, D, E}
- \(q_3\): "backpropagation and gradient descent" → retrieves docs: {C, F, G}
Union: {A, B, C, D, E, F, G} (7 unique documents)
If single query \(q_1\) only retrieved {A, B, C}, multi-query found 4 additional relevant documents (D, E, F, G).
Coverage improvement: From 3/7 = 43% to 7/7 = 100% of relevant documents found.
5. Context Compression Ratio
What This Measures:
When retrieved context exceeds the LLM's context window, you need to compress it. This formula measures the compression achieved (how much smaller the compressed context is compared to original).
Breaking It Down:
- \(|\text{original\_context}|\): Token count of original retrieved context
- \(|\text{compressed\_context}|\): Token count after compression (summarization, extraction, etc.)
- Compression ratio: How many times smaller the compressed version is
Example:
Original context: 5,000 tokens (5 retrieved chunks × 1,000 tokens each)
Compressed context: 2,000 tokens (summarized chunks)
Compression ratio: \(\frac{5000}{2000} = 2.5x\)
This means the compressed context is 2.5 times smaller, allowing you to fit more chunks or leave more room for the answer.
Trade-off:
Higher compression ratio = more space saved but risk of losing important details. Lower compression ratio = preserves more information but less space saved. Typical compression ratios: 2-5x for summarization, 3-10x for sentence extraction.
Detailed Examples
Step-by-Step Examples
Example: RAG Generation Process
Query: "What is the capital of France?"
Step 1: Retrieval
- Retrieved context: "France is a country in Europe. Its capital city is Paris."
Step 2: Prompt Construction
- Prompt: "Context: France is a country in Europe. Its capital city is Paris. Question: What is the capital of France? Answer:"
Step 3: Generation
- LLM generates: "The capital of France is Paris."
- Answer is grounded in retrieved context
Example: Context Truncation
Scenario: Retrieved 5 documents, each 500 tokens. Context window: 4000 tokens. Query: 50 tokens.
Problem: 5 × 500 + 50 = 2550 tokens (fits, but close to limit)
Solution: Prioritize most relevant documents, truncate if needed
Strategy: Include top-3 documents (1500 tokens), truncate others if needed
Implementation
Implementation Overview
This section provides practical Python code examples for implementing advanced RAG techniques including query expansion, multi-query retrieval, context compression, and iterative retrieval. These techniques significantly improve RAG system performance beyond basic retrieval and generation.
1. Query Expansion Implementation
What this does: Expands user queries with synonyms and related terms to improve retrieval coverage. Helps find documents that use different terminology than the original query.
from sentence_transformers import SentenceTransformer
from typing import List
import re
class QueryExpander:
"""
Query expansion system that adds synonyms and related terms.
Improves retrieval by finding documents that use different
terminology than the original query.
"""
def __init__(self):
self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
# In production, use a proper synonym dictionary or LLM-based expansion
self.synonym_dict = {
"machine learning": ["ML", "artificial intelligence", "AI", "deep learning"],
"neural network": ["neural net", "NN", "deep learning model"],
"python": ["Python programming", "Python language"],
"tutorial": ["guide", "how-to", "instructions", "walkthrough"]
}
def expand_with_synonyms(self, query: str) -> str:
"""
Expand query by replacing terms with synonyms.
Args:
query: Original query
Returns:
Expanded query with synonyms
"""
expanded_terms = []
words = query.lower().split()
for word in words:
expanded_terms.append(word)
# Check if word is part of a key phrase
for key, synonyms in self.synonym_dict.items():
if key in query.lower():
expanded_terms.extend(synonyms)
# Remove duplicates and join
expanded_query = " ".join(list(set(expanded_terms)))
return expanded_query
def expand_with_llm(self, query: str, num_expansions: int = 3) -> List[str]:
"""
Use LLM to generate query variations.
Args:
query: Original query
num_expansions: Number of variations to generate
Returns:
List of expanded query variations
"""
# In production, use OpenAI API or similar
# This is a simplified example
prompt = f"""Generate {num_expansions} different ways to ask this question:
"{query}"
Return only the variations, one per line:"""
# For demonstration, return manual variations
# In production, call LLM API here
variations = [
query, # Original
f"Explain {query}",
f"What is {query}?",
f"Tell me about {query}"
]
return variations[:num_expansions]
# Example usage
expander = QueryExpander()
query = "Python machine learning tutorial"
expanded = expander.expand_with_synonyms(query)
print(f"Original: {query}")
print(f"Expanded: {expanded}")
# Use expanded query for retrieval
# retriever.retrieve(expanded)
Key Points:
- Synonym expansion: Adds related terms to improve coverage
- LLM-based expansion: Can generate query variations automatically
- Use case: Particularly useful for short, ambiguous queries
2. Multi-Query Retrieval Implementation
What this does: Generates multiple query variations, retrieves documents for each, then combines and deduplicates results for better coverage.
from typing import List, Set
from sentence_transformers import SentenceTransformer
class MultiQueryRetriever:
"""
Multi-query retrieval that generates multiple query variations
and combines results for better coverage.
"""
def __init__(self, base_retriever):
"""
Initialize multi-query retriever.
Args:
base_retriever: Underlying retriever (HybridRetriever, etc.)
"""
self.base_retriever = base_retriever
self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
def generate_query_variations(self, query: str, num_variations: int = 3) -> List[str]:
"""
Generate query variations using embedding similarity.
Args:
query: Original query
num_variations: Number of variations to generate
Returns:
List of query variations
"""
# In production, use LLM to generate variations
# For demonstration, create simple variations
variations = [query] # Include original
# Add variations based on query structure
if "?" in query:
variations.append(query.replace("?", ""))
variations.append(f"Explain {query.lower()}")
else:
variations.append(f"{query} explanation")
variations.append(f"What is {query}?")
return variations[:num_variations]
def retrieve_multi_query(self, query: str, top_k_per_query: int = 10, final_k: int = 5) -> List[dict]:
"""
Retrieve using multiple query variations and combine results.
Args:
query: Original query
top_k_per_query: Number of results per query variation
final_k: Final number of results after combining
Returns:
Combined and deduplicated results
"""
# Step 1: Generate query variations
query_variations = self.generate_query_variations(query, num_variations=3)
print(f"Generated {len(query_variations)} query variations")
# Step 2: Retrieve for each variation
all_results = []
seen_documents = set()
for variation in query_variations:
results = self.base_retriever.retrieve(variation, top_k=top_k_per_query)
# Deduplicate and add to combined results
for result in results:
doc_text = result['document']
if doc_text not in seen_documents:
all_results.append(result)
seen_documents.add(doc_text)
# Step 3: Rerank combined results (optional but recommended)
# For simplicity, sort by hybrid_score
all_results.sort(key=lambda x: x.get('hybrid_score', 0), reverse=True)
# Step 4: Return top-k
return all_results[:final_k]
# Example usage
# multi_retriever = MultiQueryRetriever(hybrid_retriever)
# results = multi_retriever.retrieve_multi_query("machine learning", final_k=5)
3. Context Compression Implementation
What this does: Compresses retrieved context to fit within LLM context windows while preserving the most relevant information.
from typing import List
import re
class ContextCompressor:
"""
Compresses retrieved context to fit within context window limits
while preserving the most relevant information.
"""
def __init__(self, max_tokens: int = 2000):
"""
Initialize context compressor.
Args:
max_tokens: Maximum tokens allowed in compressed context
"""
self.max_tokens = max_tokens
def compress_by_relevance(self, query: str, documents: List[str],
relevance_scores: List[float]) -> str:
"""
Compress context by keeping only most relevant parts.
Args:
query: User query
documents: List of retrieved documents
relevance_scores: Relevance scores for each document
Returns:
Compressed context string
"""
# Sort documents by relevance
doc_scores = list(zip(documents, relevance_scores))
doc_scores.sort(key=lambda x: x[1], reverse=True)
# Add documents until we hit token limit
compressed_parts = []
current_tokens = 0
for doc, score in doc_scores:
doc_tokens = len(doc.split()) # Approximate token count
if current_tokens + doc_tokens <= self.max_tokens:
compressed_parts.append(doc)
current_tokens += doc_tokens
else:
# Truncate last document if needed
remaining_tokens = self.max_tokens - current_tokens
if remaining_tokens > 50: # Only if meaningful space left
words = doc.split()[:remaining_tokens]
compressed_parts.append(" ".join(words) + "...")
break
return "\n\n".join(compressed_parts)
def compress_by_summarization(self, documents: List[str]) -> str:
"""
Compress by summarizing documents (simplified example).
In production, use LLM-based summarization.
"""
# Simplified: extract first sentence of each document
summaries = []
for doc in documents:
sentences = doc.split('.')
if sentences:
summaries.append(sentences[0] + '.')
return " ".join(summaries)
# Example usage
compressor = ContextCompressor(max_tokens=1000)
documents = [
"Machine learning is a subset of artificial intelligence...",
"Deep learning uses neural networks...",
"Python is a programming language..."
]
scores = [0.95, 0.85, 0.60]
compressed = compressor.compress_by_relevance(
"What is machine learning?",
documents,
scores
)
print(f"Compressed context ({len(compressed.split())} tokens):")
print(compressed)
4. Complete Advanced RAG Pipeline
What this does: Combines all advanced techniques into a complete RAG system with query expansion, multi-query retrieval, context compression, and generation.
class AdvancedRAG:
"""
Complete advanced RAG system with query expansion, multi-query,
context compression, and generation.
"""
def __init__(self, retriever, llm):
self.retriever = retriever
self.llm = llm
self.query_expander = QueryExpander()
self.multi_query = MultiQueryRetriever(retriever)
self.compressor = ContextCompressor(max_tokens=2000)
def query(self, user_query: str, use_expansion: bool = True,
use_multi_query: bool = True, use_compression: bool = True):
"""
Complete advanced RAG query pipeline.
Args:
user_query: User's question
use_expansion: Whether to expand query
use_multi_query: Whether to use multi-query retrieval
use_compression: Whether to compress context
"""
# Step 1: Query expansion (optional)
if use_expansion:
expanded_query = self.query_expander.expand_with_synonyms(user_query)
else:
expanded_query = user_query
# Step 2: Multi-query retrieval (optional)
if use_multi_query:
retrieved = self.multi_query.retrieve_multi_query(
expanded_query,
top_k_per_query=10,
final_k=10
)
documents = [r['document'] for r in retrieved]
scores = [r.get('hybrid_score', 0.5) for r in retrieved]
else:
retrieved = self.retriever.retrieve(expanded_query, top_k=10)
documents = [r['document'] for r in retrieved]
scores = [r.get('hybrid_score', 0.5) for r in retrieved]
# Step 3: Context compression (optional)
if use_compression:
context = self.compressor.compress_by_relevance(
user_query,
documents,
scores
)
else:
context = "\n\n".join(documents)
# Step 4: Generate answer
prompt = f"""Context:
{context}
Question: {user_query}
Answer based on the context above:"""
# In production, use actual LLM API
answer = f"Generated answer for: {user_query}" # Placeholder
return answer
# Example usage
# advanced_rag = AdvancedRAG(hybrid_retriever, llm)
# answer = advanced_rag.query("What is machine learning?")
Installation Requirements
Install required packages:
pip install sentence-transformers langchain openai
Note: For production, implement proper LLM-based query expansion and summarization using OpenAI API or similar services.
Real-World Applications
Where This Is Used
Generation Strategies
Stuff: Put all context in single prompt. Simple but limited by context window.
Map-reduce: Generate answer for each document, then combine. Handles large contexts.
Refine: Iteratively refine answer with each document. Most accurate but slowest.
Map-rerank: Generate and score answers, return best. Good balance.
Answer Quality
Relevance: Answer addresses the query
Faithfulness: Answer is grounded in retrieved context (not hallucinated)
Completeness: Answer is comprehensive
Citation: Can cite source documents