Chapter 6: Advanced RAG Techniques

Improving Performance

Learning Objectives

Understand advanced rag techniques fundamentals
Master the mathematical foundations
Learn practical implementation
Apply knowledge through examples
Recognize real-world applications

Advanced RAG Techniques

Beyond Basic RAG: Optimizing for Real-World Performance

Basic RAG (retrieve top-k documents, pass to LLM, generate answer) works well for simple queries, but real-world applications face complex challenges: ambiguous queries, context window limits, multi-part questions, and the need for higher accuracy. Advanced RAG techniques address these challenges to significantly improve system performance.

Common problems with basic RAG:

❌ Query ambiguity: Short queries like "Python ML" are unclear - does the user want Python machine learning libraries, Python ML algorithms, or something else?
❌ Context overflow: Retrieved chunks might exceed LLM context window limits, forcing truncation and loss of information
❌ Incomplete answers: Complex questions require information from multiple documents, but basic RAG retrieves once and generates once
❌ Precision vs context trade-off: Small chunks are precise but lack context; large chunks have context but are less precise
❌ Single retrieval limitation: One retrieval pass might miss relevant documents that use different terminology

Advanced Techniques That Solve These Problems

Query Expansion: Automatically expand or rewrite queries to include synonyms, related terms, and alternative phrasings before retrieval
Multi-Query Retrieval: Generate multiple query variations, retrieve for each, and combine results for comprehensive coverage
Parent-Child Chunking: Store small chunks for precise retrieval, but include parent document context when generating answers
Context Compression: Summarize or extract only relevant parts of retrieved chunks to fit within context windows
Iterative Retrieval: If initial answer is incomplete, generate follow-up queries and retrieve more context (multi-hop retrieval)
Self-RAG: Autonomous system where the model decides when to retrieve, when to generate, and when to stop

Example: Multi-Query Retrieval in Action

Original query: "How to train a neural network?"

Generated variations:

"neural network training process"
"how to train deep learning models"
"backpropagation and gradient descent for neural networks"

Result: Each variation retrieves slightly different documents. Combined, you get comprehensive coverage of the topic - you don't miss relevant information that uses different terminology.

Key Concepts You'll Learn

Query Expansion: Techniques to improve retrieval by expanding queries with synonyms and related terms
Multi-Query Retrieval: Generating multiple query variations and combining results for better coverage
Parent-Child Chunking: Hierarchical chunking strategy that balances retrieval precision with generation context
Context Compression: Methods to reduce retrieved context size (summarization, sentence extraction) while preserving important information
Iterative Retrieval: Multi-hop retrieval where follow-up queries retrieve additional context if initial answer is incomplete
Self-RAG: Advanced autonomous RAG where the model controls retrieval and generation decisions
Metadata Filtering: Combining vector search with traditional filters to reduce search space and improve precision

Why this matters: These advanced techniques can improve RAG system accuracy by 20-40% compared to basic RAG. They're essential for production systems where accuracy, completeness, and user experience are critical. Understanding when and how to apply these techniques is what separates basic RAG implementations from production-grade systems.

Key Concepts

Advanced RAG Techniques: Beyond Basic Retrieval

Basic RAG (retrieve top-k documents, pass to LLM) works well for simple queries, but real-world applications often need more sophisticated techniques to handle complex queries, improve accuracy, and optimize performance.

1. Query Expansion: Improving Retrieval Coverage

What it is: Query expansion rewrites or augments the user's query to include synonyms, related terms, or alternative phrasings before retrieval. This helps find documents that use different terminology than the original query.

Why it's needed: Users often write short, ambiguous queries. A query like "Python ML" could mean many things, and documents might use different terminology ("machine learning," "deep learning," "scikit-learn," etc.).

How it works:

Synonym expansion: Add synonyms for key terms (e.g., "ML" → "machine learning," "deep learning")
Related term expansion: Add related concepts (e.g., "Python ML" → "scikit-learn," "TensorFlow," "PyTorch")
LLM-based expansion: Use an LLM to generate query variations or rewrite the query more clearly
Retrieve with expanded query: Use the expanded query for retrieval (or retrieve for each variation and combine)

Example: Query Expansion

Original query: "Python ML"

Expanded query: "Python machine learning libraries scikit-learn TensorFlow PyTorch data science"

Result: ✅ Retrieves documents that mention "scikit-learn" or "TensorFlow" even if they don't contain "Python ML" as a phrase.

When to use: When queries are short, ambiguous, or use abbreviations. Particularly useful for technical domains with lots of synonyms and jargon.

2. Multi-Query Retrieval: Generating Multiple Query Variations

What it is: Instead of retrieving with a single query, generate multiple query variations, retrieve documents for each variation, then combine and deduplicate the results.

Why it works: Different phrasings of the same question might retrieve different relevant documents. By generating multiple variations, you increase the chance of finding all relevant information.

How it works:

Generate query variations: Use an LLM to generate 3-5 different phrasings of the original query
Retrieve for each: Run retrieval for each query variation
Combine results: Merge all retrieved documents, removing duplicates
Rerank: Rerank the combined results to get the most relevant documents

Example: Multi-Query Retrieval

Original query: "How to train a neural network?"

Generated variations:

"neural network training process"
"how to train deep learning models"
"backpropagation and gradient descent for neural networks"

Result: ✅ Each variation might retrieve slightly different documents, giving you comprehensive coverage of the topic.

When to use: For complex queries where a single phrasing might miss relevant documents. Particularly effective when combined with query expansion.

3. Parent-Child Chunking: Balancing Precision and Context

What it is: A hierarchical chunking strategy where you store small, precise chunks for retrieval (children), but also store larger parent chunks that contain surrounding context.

The problem it solves: Small chunks are better for precise retrieval (more likely to be fully relevant), but they lack context. Large chunks have more context but are less precise (may contain irrelevant information).

How it works:

Create parent chunks: Split documents into larger chunks (e.g., 1000 tokens) that preserve context
Create child chunks: Split each parent into smaller chunks (e.g., 200 tokens) for precise retrieval
Store both: Embed and store both parent and child chunks, with metadata linking children to parents
Retrieve children: Use small child chunks for retrieval (precise matching)
Include parent context: When a child is retrieved, also include its parent chunk in the context sent to the LLM

Example: Parent-Child Chunking

Parent chunk (1000 tokens): "Machine learning models require careful tuning. Hyperparameters like learning rate significantly impact performance. Regularization techniques help prevent overfitting..."

Child chunk 1 (200 tokens): "Hyperparameters like learning rate significantly impact performance."

Query: "What is learning rate?"

Result: ✅ Child chunk 1 is retrieved (precise match), but parent chunk is also included, providing context about hyperparameters and regularization.

When to use: When you need both precise retrieval (small chunks) and rich context (large chunks). Common in production RAG systems.

4. Metadata Filtering: Reducing Search Space

What it is: Apply filters based on document metadata (date, author, category, source) before performing vector similarity search. This reduces the number of documents to search.

Why it's important: Searching 1 million documents is slow. If you can filter to 50,000 relevant documents first (e.g., "only documents from 2023"), search becomes much faster and more accurate.

How it works:

Extract metadata: When indexing, extract and store metadata (date, category, author, etc.)
Apply filters: Before vector search, filter documents by metadata criteria
Search filtered set: Perform vector similarity search only on filtered documents
Return results: Return top-k documents from the filtered search

When to use: When documents have meaningful metadata and queries can be scoped (e.g., "recent articles," "technical documentation," "from specific author").

Context Compression: Fitting Retrieved Information into LLM Windows

The Problem

After retrieving top-k documents, you might have 10,000+ tokens of context. But LLMs have context window limits (e.g., GPT-4: 8K-128K tokens, Claude: 100K-200K tokens). If your query + context + answer exceeds the limit, you need to compress the context.

Context Compression Techniques

1. Summarization

What it is: Use an LLM to summarize each retrieved chunk, reducing token count while preserving key information.

How it works:

For each retrieved chunk, create a summary prompt: "Summarize this text in 2-3 sentences: [chunk text]"
Generate summaries (much shorter than originals)
Use summaries as context instead of full chunks

Trade-offs: ✅ Reduces tokens significantly. ❌ May lose important details. ⚠️ Adds latency (need to summarize each chunk).

2. Sentence Extraction

What it is: Extract only the most relevant sentences from retrieved chunks, discarding the rest.

How it works:

Score each sentence in retrieved chunks for relevance to the query (using embeddings or LLM)
Select top-N most relevant sentences
Use only these sentences as context

Trade-offs: ✅ Very fast. ✅ Preserves exact information (no summarization loss). ❌ May lose context between sentences.

3. LLM-Based Compression

What it is: Use an LLM to compress the entire retrieved context into a shorter version that preserves information relevant to the query.

How it works:

Prompt: "Compress this context to [target tokens] while preserving all information relevant to: [query]"
LLM generates compressed version
Use compressed context for final answer generation

Trade-offs: ✅ Most intelligent compression. ✅ Preserves query-relevant information. ❌ Expensive and slow. ❌ May introduce hallucinations.

4. Relevance-Based Prioritization

What it is: Instead of compressing, prioritize chunks by relevance score and include only the most relevant ones until you hit the token limit.

How it works:

Sort retrieved chunks by similarity score (most relevant first)
Add chunks to context one by one until you reach token limit
Discard remaining chunks

Trade-offs: ✅ Fast and simple. ✅ No information loss (just truncation). ❌ May miss important information in lower-ranked chunks.

Best Practices

Combine techniques: Use summarization for less relevant chunks, full text for most relevant
Reserve tokens for answer: Leave 20-30% of context window for the LLM's answer
Monitor compression quality: Track answer quality with and without compression

Iterative Retrieval: Multi-Hop and Self-RAG

1. Multi-Hop Retrieval

What it is: If the initial answer is incomplete or the model needs more information, generate follow-up queries, retrieve additional context, and repeat until the answer is complete.

Why it's needed: Complex questions often require information from multiple documents. A single retrieval might not find all necessary information.

How it works:

Initial retrieval: Retrieve top-k documents for the original query
Generate answer: Attempt to generate answer from retrieved context
Check completeness: Determine if answer is complete (using LLM or heuristics)
Generate follow-up query: If incomplete, generate a new query to find missing information
Retrieve again: Retrieve documents for the follow-up query
Combine context: Add new context to existing context
Generate final answer: Generate answer from combined context
Repeat if needed: Continue until answer is complete or max iterations reached

Example: Multi-Hop Retrieval

Query: "What is the capital of France and what is its population?"

Hop 1: Query "capital of France" → Retrieves document about Paris being the capital

Hop 2: Query "Paris population" → Retrieves document with population statistics

Result: ✅ Combines information from both hops to answer the complete question.

When to use: For complex, multi-part questions that require information from multiple sources. Common in question-answering systems.

2. Self-RAG: Autonomous Retrieval and Generation

What it is: An advanced technique where the model itself decides when to retrieve, what to retrieve, when to generate, and when to stop—making the RAG system more autonomous.

How it works:

Retrieve decision: Model decides if retrieval is needed (some queries can be answered from its training data)
Query generation: If retrieval needed, model generates the retrieval query
Retrieval: Retrieve documents using generated query
Relevance check: Model evaluates if retrieved documents are relevant
Generate or retrieve again: If relevant, generate answer; if not, retrieve again with different query
Self-critique: Model evaluates its own answer for completeness and accuracy
Iterate or stop: If answer is good, stop; if not, continue retrieving and generating

Advantages:

✅ More autonomous: Doesn't always retrieve (saves cost when not needed)
✅ Adaptive: Adjusts retrieval strategy based on query complexity
✅ Self-correcting: Can identify when answers are incomplete and retrieve more

Challenges:

❌ Complex to implement: Requires fine-tuning or prompt engineering
❌ Higher latency: Multiple decision points add time
❌ Cost: More LLM calls (retrieve decisions, query generation, self-critique)

When to use: For high-value applications where accuracy is critical and you can afford the complexity and cost. Still experimental but promising.

Mathematical Formulations

Advanced RAG Mathematical Models

Advanced RAG techniques involve mathematical models for generation, context management, and quality evaluation. Understanding these formulas helps you optimize context usage, implement iterative retrieval, and measure answer quality.

1. RAG Generation Probability

\[P(\text{answer} | \text{query}, \text{context}) = \prod_{i=1}^{n} P(\text{token}_i | \text{query}, \text{context}, \text{tokens}_{

What This Formula Represents:

This is the fundamental probability model for RAG generation. The LLM generates the answer token by token, where each token's probability depends on the query, retrieved context, and all previously generated tokens.

Breaking It Down:

\(P(\text{answer} | \text{query}, \text{context})\): Probability of generating the entire answer given the query and retrieved context
\(\prod_{i=1}^{n}\): Product over all \(n\) tokens in the answer (multiplication of probabilities)
\(P(\text{token}_i | \text{query}, \text{context}, \text{tokens}_{ Probability of generating token \(i\) given:

The query (user's question)

The context (retrieved documents)

All previous tokens (\(\text{tokens}_{

Why It's a Product:

To generate the full answer, the model must generate token 1, THEN token 2 given token 1, THEN token 3 given tokens 1 and 2, and so on. The probability of the entire sequence is the product of these conditional probabilities (chain rule of probability).

Key Insight:

The formula shows that the answer depends on BOTH the query AND the context. This is what makes RAG different from standard LLM generation - the context from retrieved documents directly influences each token's probability.

Example:

Query: "What is the capital of France?"
Context: "France is a country in Europe. Its capital city is Paris."

The model generates:

Token 1 ("The"): \(P(\text{"The"} | \text{query}, \text{context})\) - high probability because context mentions "capital"

Token 2 ("capital"): \(P(\text{"capital"} | \text{query}, \text{context}, \text{"The"})\) - high probability, matches context

Token 3 ("of"): \(P(\text{"of"} | \text{query}, \text{context}, \text{"The capital"})\) - high probability, grammatical continuation

Token 4 ("France"): \(P(\text{"France"} | \text{query}, \text{context}, \text{"The capital of"})\) - high probability, matches query

Token 5 ("is"): \(P(\text{"is"} | \text{query}, \text{context}, \text{"The capital of France"})\) - high probability

Token 6 ("Paris"): \(P(\text{"Paris"} | \text{query}, \text{context}, \text{"The capital of France is"})\) - very high probability, directly from context!

The final answer "The capital of France is Paris" has high probability because each token is likely given the context.

2. Context Window Constraint

\[|\text{query}| + |\text{context}| + |\text{answer}| \leq \text{max\_context\_length}\]

What This Constraint Enforces:

LLMs have fixed context window limits. The total number of tokens (query + retrieved context + generated answer) must fit within this limit. This constraint determines how much context you can include and how long answers can be.

Breaking It Down:

\(|\text{query}|\): Number of tokens in the user's query (typically 10-50 tokens)

\(|\text{context}|\): Number of tokens in retrieved documents (can be 500-4000+ tokens depending on top-k and chunk size)

\(|\text{answer}|\): Number of tokens in the generated answer (variable, depends on query complexity)

\(\text{max\_context\_length}\): Maximum tokens the LLM can process (e.g., GPT-4: 8K-128K, Claude: 100K-200K, GPT-3.5: 4K-16K)

Practical Implications:

If your context window is 4,000 tokens and your query is 50 tokens, you have ~3,950 tokens available for context + answer. If you retrieve 3,000 tokens of context, you only have ~950 tokens left for the answer. This is why context compression is important!

Example:

Scenario: GPT-3.5-turbo with 4,096 token context window

Query: 30 tokens

Retrieved context: 3,500 tokens (5 chunks × 700 tokens each)

Available for answer: 4,096 - 30 - 3,500 = 566 tokens

✅ Fits, but close to limit

Problem scenario:

Query: 30 tokens

Retrieved context: 4,200 tokens (too much!)

Available for answer: 4,096 - 30 - 4,200 = -134 tokens ❌

Solution: Compress context to 3,000 tokens, leaving 1,066 tokens for answer

Strategies to Handle This:

Context compression: Summarize or extract relevant parts to reduce \(|\text{context}|\)

Prioritize chunks: Include only top-k most relevant chunks

Reserve tokens: Leave 20-30% of context window for answer generation

Use larger context windows: GPT-4 (128K) or Claude (200K) for very long contexts

3. Answer Quality Score

\[\text{quality} = \alpha \cdot \text{relevance} + \beta \cdot \text{faithfulness} + \gamma \cdot \text{completeness}\]

What This Formula Measures:

Answer quality in RAG systems is multi-dimensional. This formula combines three critical dimensions into a single quality score, allowing you to evaluate and optimize RAG system performance.

Breaking It Down:

\(\text{relevance}\): How well the answer addresses the query. Measured by semantic similarity between query and answer embeddings, or by human/LM evaluation. Range: [0, 1]

\(\text{faithfulness}\): How well the answer is grounded in the retrieved context (not hallucinated). Measured as fraction of answer claims supported by context. Range: [0, 1]

\(\text{completeness}\): How complete the answer is - does it fully address all parts of the query? Measured by coverage of query aspects. Range: [0, 1]

\(\alpha, \beta, \gamma\): Weighting factors that sum to 1.0, controlling the relative importance of each dimension

Typical Weightings:

Balanced: \(\alpha = 0.4, \beta = 0.4, \gamma = 0.2\) - Equal emphasis on relevance and faithfulness, less on completeness

Faithfulness-focused: \(\alpha = 0.3, \beta = 0.5, \gamma = 0.2\) - Emphasizes avoiding hallucinations (critical for factual domains)

Completeness-focused: \(\alpha = 0.3, \beta = 0.3, \gamma = 0.4\) - Emphasizes comprehensive answers (good for complex queries)

Example:

Query: "What are the main advantages of RAG?"

Answer 1: "RAG is good."
Relevance: 0.6 (somewhat relevant but vague)
Faithfulness: 1.0 (supported by context)
Completeness: 0.2 (very incomplete)
Quality (α=0.4, β=0.4, γ=0.2): \(0.4 \times 0.6 + 0.4 \times 1.0 + 0.2 \times 0.2 = 0.68\)

Answer 2: "RAG has several advantages: no training required, easy to update knowledge, can cite sources, works with any LLM."
Relevance: 0.95 (highly relevant)
Faithfulness: 0.9 (mostly supported, minor elaboration)
Completeness: 0.9 (covers main advantages)
Quality (α=0.4, β=0.4, γ=0.2): \(0.4 \times 0.95 + 0.4 \times 0.9 + 0.2 \times 0.9 = 0.92\)

✅ Answer 2 scores much higher due to better relevance and completeness.

Using This for Optimization:

Track quality scores over time. If faithfulness drops, you might need better retrieval or context. If relevance drops, you might need better query understanding. If completeness drops, you might need to retrieve more documents or use iterative retrieval.

4. Multi-Query Retrieval Coverage

\[\text{coverage} = \frac{|\bigcup_{i=1}^{m} \text{Retrieve}(q_i, D)|}{|D_{\text{relevant}}|}\]

What This Measures:

Multi-query retrieval generates \(m\) query variations and retrieves documents for each. This formula measures what fraction of all relevant documents were found by at least one query variation.

Breaking It Down:

\(q_1, q_2, \ldots, q_m\): \(m\) different query variations (e.g., 3-5 variations)

\(\text{Retrieve}(q_i, D)\): Documents retrieved for query variation \(q_i\)

\(\bigcup_{i=1}^{m}\): Union of all retrieved document sets (removes duplicates)

\(|D_{\text{relevant}}|\): Total number of relevant documents in the knowledge base

Coverage: Fraction of relevant documents found by at least one query variation

Why Multi-Query Improves Coverage:

Different query phrasings might retrieve different relevant documents. By generating multiple variations and taking the union, you increase the chance of finding all relevant documents.

Example:

Original query: "How to train a neural network?"

Query variations:

\(q_1\): "neural network training process" → retrieves docs: {A, B, C}

\(q_2\): "how to train deep learning models" → retrieves docs: {B, D, E}

\(q_3\): "backpropagation and gradient descent" → retrieves docs: {C, F, G}

Union: {A, B, C, D, E, F, G} (7 unique documents)

If single query \(q_1\) only retrieved {A, B, C}, multi-query found 4 additional relevant documents (D, E, F, G).

Coverage improvement: From 3/7 = 43% to 7/7 = 100% of relevant documents found.

5. Context Compression Ratio

\[\text{compression\_ratio} = \frac{|\text{original\_context}|}{|\text{compressed\_context}|}\]

What This Measures:

When retrieved context exceeds the LLM's context window, you need to compress it. This formula measures the compression achieved (how much smaller the compressed context is compared to original).

Breaking It Down:

\(|\text{original\_context}|\): Token count of original retrieved context

\(|\text{compressed\_context}|\): Token count after compression (summarization, extraction, etc.)

Compression ratio: How many times smaller the compressed version is

Example:

Original context: 5,000 tokens (5 retrieved chunks × 1,000 tokens each)
Compressed context: 2,000 tokens (summarized chunks)
Compression ratio: \(\frac{5000}{2000} = 2.5x\)

This means the compressed context is 2.5 times smaller, allowing you to fit more chunks or leave more room for the answer.

Trade-off:

Higher compression ratio = more space saved but risk of losing important details. Lower compression ratio = preserves more information but less space saved. Typical compression ratios: 2-5x for summarization, 3-10x for sentence extraction.

Detailed Examples

Step-by-Step Examples

Example: RAG Generation Process

Query: "What is the capital of France?"

Step 1: Retrieval

Retrieved context: "France is a country in Europe. Its capital city is Paris."

Step 2: Prompt Construction

Prompt: "Context: France is a country in Europe. Its capital city is Paris. Question: What is the capital of France? Answer:"

Step 3: Generation

LLM generates: "The capital of France is Paris."

Answer is grounded in retrieved context

Example: Context Truncation

Scenario: Retrieved 5 documents, each 500 tokens. Context window: 4000 tokens. Query: 50 tokens.

Problem: 5 × 500 + 50 = 2550 tokens (fits, but close to limit)

Solution: Prioritize most relevant documents, truncate if needed

Strategy: Include top-3 documents (1500 tokens), truncate others if needed

Implementation

Implementation Overview

This section provides practical Python code examples for implementing advanced RAG techniques including query expansion, multi-query retrieval, context compression, and iterative retrieval. These techniques significantly improve RAG system performance beyond basic retrieval and generation.

1. Query Expansion Implementation

What this does: Expands user queries with synonyms and related terms to improve retrieval coverage. Helps find documents that use different terminology than the original query.

from sentence_transformers import SentenceTransformer from typing import List import re class QueryExpander: """ Query expansion system that adds synonyms and related terms. Improves retrieval by finding documents that use different terminology than the original query. """ def __init__(self): self.embedder = SentenceTransformer('all-MiniLM-L6-v2') # In production, use a proper synonym dictionary or LLM-based expansion self.synonym_dict = { "machine learning": ["ML", "artificial intelligence", "AI", "deep learning"], "neural network": ["neural net", "NN", "deep learning model"], "python": ["Python programming", "Python language"], "tutorial": ["guide", "how-to", "instructions", "walkthrough"] } def expand_with_synonyms(self, query: str) -> str: """ Expand query by replacing terms with synonyms. Args: query: Original query Returns: Expanded query with synonyms """ expanded_terms = [] words = query.lower().split() for word in words: expanded_terms.append(word) # Check if word is part of a key phrase for key, synonyms in self.synonym_dict.items(): if key in query.lower(): expanded_terms.extend(synonyms) # Remove duplicates and join expanded_query = " ".join(list(set(expanded_terms))) return expanded_query def expand_with_llm(self, query: str, num_expansions: int = 3) -> List[str]: """ Use LLM to generate query variations. Args: query: Original query num_expansions: Number of variations to generate Returns: List of expanded query variations """ # In production, use OpenAI API or similar # This is a simplified example prompt = f"""Generate {num_expansions} different ways to ask this question: "{query}" Return only the variations, one per line:""" # For demonstration, return manual variations # In production, call LLM API here variations = [ query, # Original f"Explain {query}", f"What is {query}?", f"Tell me about {query}" ] return variations[:num_expansions] # Example usage expander = QueryExpander() query = "Python machine learning tutorial" expanded = expander.expand_with_synonyms(query) print(f"Original: {query}") print(f"Expanded: {expanded}") # Use expanded query for retrieval # retriever.retrieve(expanded)

Key Points:

Synonym expansion: Adds related terms to improve coverage

LLM-based expansion: Can generate query variations automatically

Use case: Particularly useful for short, ambiguous queries

2. Multi-Query Retrieval Implementation

What this does: Generates multiple query variations, retrieves documents for each, then combines and deduplicates results for better coverage.

from typing import List, Set from sentence_transformers import SentenceTransformer class MultiQueryRetriever: """ Multi-query retrieval that generates multiple query variations and combines results for better coverage. """ def __init__(self, base_retriever): """ Initialize multi-query retriever. Args: base_retriever: Underlying retriever (HybridRetriever, etc.) """ self.base_retriever = base_retriever self.embedder = SentenceTransformer('all-MiniLM-L6-v2') def generate_query_variations(self, query: str, num_variations: int = 3) -> List[str]: """ Generate query variations using embedding similarity. Args: query: Original query num_variations: Number of variations to generate Returns: List of query variations """ # In production, use LLM to generate variations # For demonstration, create simple variations variations = [query] # Include original # Add variations based on query structure if "?" in query: variations.append(query.replace("?", "")) variations.append(f"Explain {query.lower()}") else: variations.append(f"{query} explanation") variations.append(f"What is {query}?") return variations[:num_variations] def retrieve_multi_query(self, query: str, top_k_per_query: int = 10, final_k: int = 5) -> List[dict]: """ Retrieve using multiple query variations and combine results. Args: query: Original query top_k_per_query: Number of results per query variation final_k: Final number of results after combining Returns: Combined and deduplicated results """ # Step 1: Generate query variations query_variations = self.generate_query_variations(query, num_variations=3) print(f"Generated {len(query_variations)} query variations") # Step 2: Retrieve for each variation all_results = [] seen_documents = set() for variation in query_variations: results = self.base_retriever.retrieve(variation, top_k=top_k_per_query) # Deduplicate and add to combined results for result in results: doc_text = result['document'] if doc_text not in seen_documents: all_results.append(result) seen_documents.add(doc_text) # Step 3: Rerank combined results (optional but recommended) # For simplicity, sort by hybrid_score all_results.sort(key=lambda x: x.get('hybrid_score', 0), reverse=True) # Step 4: Return top-k return all_results[:final_k] # Example usage # multi_retriever = MultiQueryRetriever(hybrid_retriever) # results = multi_retriever.retrieve_multi_query("machine learning", final_k=5)

3. Context Compression Implementation

What this does: Compresses retrieved context to fit within LLM context windows while preserving the most relevant information.

from typing import List import re class ContextCompressor: """ Compresses retrieved context to fit within context window limits while preserving the most relevant information. """ def __init__(self, max_tokens: int = 2000): """ Initialize context compressor. Args: max_tokens: Maximum tokens allowed in compressed context """ self.max_tokens = max_tokens def compress_by_relevance(self, query: str, documents: List[str], relevance_scores: List[float]) -> str: """ Compress context by keeping only most relevant parts. Args: query: User query documents: List of retrieved documents relevance_scores: Relevance scores for each document Returns: Compressed context string """ # Sort documents by relevance doc_scores = list(zip(documents, relevance_scores)) doc_scores.sort(key=lambda x: x[1], reverse=True) # Add documents until we hit token limit compressed_parts = [] current_tokens = 0 for doc, score in doc_scores: doc_tokens = len(doc.split()) # Approximate token count if current_tokens + doc_tokens <= self.max_tokens: compressed_parts.append(doc) current_tokens += doc_tokens else: # Truncate last document if needed remaining_tokens = self.max_tokens - current_tokens if remaining_tokens > 50: # Only if meaningful space left words = doc.split()[:remaining_tokens] compressed_parts.append(" ".join(words) + "...") break return "\n\n".join(compressed_parts) def compress_by_summarization(self, documents: List[str]) -> str: """ Compress by summarizing documents (simplified example). In production, use LLM-based summarization. """ # Simplified: extract first sentence of each document summaries = [] for doc in documents: sentences = doc.split('.') if sentences: summaries.append(sentences[0] + '.') return " ".join(summaries) # Example usage compressor = ContextCompressor(max_tokens=1000) documents = [ "Machine learning is a subset of artificial intelligence...", "Deep learning uses neural networks...", "Python is a programming language..." ] scores = [0.95, 0.85, 0.60] compressed = compressor.compress_by_relevance( "What is machine learning?", documents, scores ) print(f"Compressed context ({len(compressed.split())} tokens):") print(compressed)

4. Complete Advanced RAG Pipeline

What this does: Combines all advanced techniques into a complete RAG system with query expansion, multi-query retrieval, context compression, and generation.

class AdvancedRAG: """ Complete advanced RAG system with query expansion, multi-query, context compression, and generation. """ def __init__(self, retriever, llm): self.retriever = retriever self.llm = llm self.query_expander = QueryExpander() self.multi_query = MultiQueryRetriever(retriever) self.compressor = ContextCompressor(max_tokens=2000) def query(self, user_query: str, use_expansion: bool = True, use_multi_query: bool = True, use_compression: bool = True): """ Complete advanced RAG query pipeline. Args: user_query: User's question use_expansion: Whether to expand query use_multi_query: Whether to use multi-query retrieval use_compression: Whether to compress context """ # Step 1: Query expansion (optional) if use_expansion: expanded_query = self.query_expander.expand_with_synonyms(user_query) else: expanded_query = user_query # Step 2: Multi-query retrieval (optional) if use_multi_query: retrieved = self.multi_query.retrieve_multi_query( expanded_query, top_k_per_query=10, final_k=10 ) documents = [r['document'] for r in retrieved] scores = [r.get('hybrid_score', 0.5) for r in retrieved] else: retrieved = self.retriever.retrieve(expanded_query, top_k=10) documents = [r['document'] for r in retrieved] scores = [r.get('hybrid_score', 0.5) for r in retrieved] # Step 3: Context compression (optional) if use_compression: context = self.compressor.compress_by_relevance( user_query, documents, scores ) else: context = "\n\n".join(documents) # Step 4: Generate answer prompt = f"""Context: {context} Question: {user_query} Answer based on the context above:""" # In production, use actual LLM API answer = f"Generated answer for: {user_query}" # Placeholder return answer # Example usage # advanced_rag = AdvancedRAG(hybrid_retriever, llm) # answer = advanced_rag.query("What is machine learning?")

Installation Requirements

Install required packages:

pip install sentence-transformers langchain openai

Note: For production, implement proper LLM-based query expansion and summarization using OpenAI API or similar services.

Real-World Applications

Where This Is Used

Generation Strategies

Stuff: Put all context in single prompt. Simple but limited by context window.

Map-reduce: Generate answer for each document, then combine. Handles large contexts.

Refine: Iteratively refine answer with each document. Most accurate but slowest.

Map-rerank: Generate and score answers, return best. Good balance.

Answer Quality

Relevance: Answer addresses the query

Faithfulness: Answer is grounded in retrieved context (not hallucinated)

Completeness: Answer is comprehensive

Citation: Can cite source documents

Test Your Understanding

Question 1: What are advanced RAG techniques?

A) Only fine-tuning models

B) Methods to improve RAG performance: query expansion, multi-query retrieval, parent-child chunking, context compression, iterative retrieval, and metadata filtering

C) Adding more data helps, but advanced techniques like parent-child chunking, iterative retrieval, and metadata filtering can dramatically improve results even with the same knowledge base, showing that data volume isn't the only factor

D) Advanced RAG improvements come mainly from fine-tuning the language model on domain-specific data, which is the primary optimization method

Question 2: Interview question: "What is query expansion and why is it useful?"

A) Advanced RAG techniques primarily involve using larger and more powerful language models, which automatically improves all aspects of the system without other modifications

B) Rewriting or expanding queries with synonyms, related terms, or context to improve retrieval. Helps find relevant documents that use different terminology than the query

C) While using better models can improve performance, advanced RAG techniques like query expansion, multi-query retrieval, and context compression provide significant improvements independent of model choice, making model selection only one part of optimization

D) Only using more data

Question 3: What is parent-child chunking?

A) Document chunking is mainly about formatting and styling text for better presentation, which is useful but not critical for retrieval

B) Store small chunks for precise retrieval, but include parent document context when generating answers. Balances retrieval precision with generation context

C) To reduce file size

D) While chunking can help with storage optimization and file organization, the primary reason in RAG systems is actually to enable efficient embedding generation and retrieval, not just to make documents smaller or reduce storage costs

Question 4: Interview question: "How do you handle context window limitations in RAG?"

A) While this might seem reasonable, it's not the correct approach

B) This is incorrect

C) This comprehensive approach has been considered but doesn't work well in practice

D) Use context compression (summarize chunks, extract relevant sentences), prioritize most relevant chunks, use parent-child chunking, or implement iterative retrieval for complex queries

Question 5: What is multi-query retrieval?

A) The primary retrieval approach is to randomly select documents from the knowledge base, which ensures fair distribution of results

B) While keyword-only search is fast and effective for exact term matching, RAG systems benefit from combining it with semantic search to handle synonyms and paraphrasing, making pure keyword search insufficient for comprehensive retrieval

C) Generating multiple query variations, retrieving documents for each, and combining results to improve coverage and find more relevant information

D) Sequential search

Question 6: Interview question: "What is iterative retrieval and when would you use it?"

A) Retrieval strategies focus on sequential scanning of all documents in order, which guarantees comprehensive coverage of the knowledge base

B) Sequential search

C) Multi-hop retrieval where if initial answer is incomplete, generate follow-up queries and retrieve more context. Repeat until complete. Useful for complex, multi-part questions

D) Random or sequential document selection might seem straightforward, but effective RAG retrieval requires similarity-based ranking using embeddings or keyword scores to find the most relevant documents, not arbitrary selection methods

Question 7: In the context window constraint \(|\text{query}| + |\text{context}| + |\text{answer}| \leq \text{max\_context\_length}\), what does this mean?

A) This doesn't work

B) Total tokens (query + context + answer) must fit within LLM's context window, limiting how much retrieved context can be included

C) This comprehensive approach has been considered but doesn't work well in practice

D) While this might seem reasonable, it's not the correct approach

Question 8: Interview question: "What is Self-RAG and how does it differ from standard RAG?"

A) This comprehensive approach has been considered but doesn't work well in practice

B) Self-RAG is an autonomous system where the model decides when to retrieve, when to generate, and when to stop. More flexible than fixed retrieval-then-generate pipeline

C) This is incorrect

D) While this might seem reasonable, it's not the correct approach

Question 9: What is the answer quality score formula measuring?

A) While this might seem reasonable, it's not the correct approach

B) Not applicable

C) Combined score of relevance (addresses query), faithfulness (grounded in context), and completeness (comprehensive answer) with weighting factors

D) This comprehensive approach has been considered but doesn't work well in practice

Question 10: Interview question: "How do you implement metadata filtering in RAG?"

A) This comprehensive approach has been considered but doesn't work well in practice

B) This is incorrect

C) While this might seem reasonable, it's not the correct approach

D) Store document metadata (date, type, source) with embeddings, filter by metadata before or after similarity search, use vector database metadata filters, and combine with semantic search

Question 11: What is context compression and why is it needed?

A) While using better models can improve performance, advanced RAG techniques like query expansion, multi-query retrieval, and context compression provide significant improvements independent of model choice, making model selection only one part of optimization

B) Advanced RAG techniques primarily involve using larger and more powerful language models, which automatically improves all aspects of the system without other modifications

C) Reducing retrieved context size (summarize, extract relevant parts) to fit in LLM context window while preserving important information. Needed when retrieved chunks exceed context limits

D) Only using more data

Question 12: Interview question: "How would you optimize RAG for better answer quality?"

A) While this might seem reasonable, it's not the correct approach

B) This comprehensive approach has been considered but doesn't work well in practice

C) This is incorrect

D) Improve retrieval (better embeddings, hybrid search, reranking), use query expansion, implement parent-child chunking, add context compression, use iterative retrieval for complex queries, and evaluate with quality metrics