Chapter 7: Production RAG Systems
Deployment & Monitoring
Learning Objectives
- Understand production rag systems fundamentals
- Master the mathematical foundations
- Learn practical implementation
- Apply knowledge through examples
- Recognize real-world applications
Production RAG Systems
From Prototype to Production: Building RAG Systems That Scale
Building a working RAG prototype is one thing; deploying it to production where it handles millions of documents, thousands of queries per second, and requires 99.9%+ uptime is entirely different. Production RAG systems require careful attention to scalability, reliability, monitoring, performance optimization, and error handling.
The production challenge: A prototype that works with 1,000 documents might completely fail with 10 million. A system that responds in 2 seconds for 10 users might take minutes under load. A system that works perfectly in testing might fail in production due to edge cases, network issues, or resource constraints.
Critical Production Considerations
- Scalability: Handle millions of documents and high query throughput without performance degradation
- Latency: Sub-second response times for good user experience (retrieval + generation must be fast)
- Reliability: 99.9%+ uptime with graceful error handling and fallback mechanisms
- Monitoring: Track retrieval quality, answer quality, latency, errors, and costs in real-time
- Cost Management: Optimize API costs (embedding APIs, LLM APIs) while maintaining quality
- Error Handling: Graceful degradation when retrieval fails, no documents found, or LLM errors occur
- Performance Optimization: Caching, batch processing, async operations, and model selection for speed
Production vs Prototype Differences:
Prototype:
- ✅ Works with 1,000 documents
- ✅ 2-5 second response time acceptable
- ✅ Manual testing, no monitoring
- ✅ Crashes are okay, just restart
- ✅ No error handling needed
Production:
- ❌ Must handle 10+ million documents
- ❌ Need sub-second latency for good UX
- ❌ Real-time monitoring and alerting required
- ❌ 99.9%+ uptime, graceful error handling
- ❌ Comprehensive error handling and fallbacks
Key Concepts You'll Learn
- Scalability Strategies: Distributed vector databases, sharding, horizontal scaling, and efficient indexing for billion-scale systems
- Monitoring & Evaluation: Tracking retrieval metrics (precision@k, recall@k), answer quality (faithfulness, relevance), and operational metrics (latency, throughput, errors)
- Error Handling: Graceful fallbacks, retry mechanisms, validation, and circuit breakers for resilient systems
- Performance Optimization: Caching strategies, batch processing, async operations, and model selection for speed vs quality trade-offs
- Cost Optimization: Reducing API costs through caching, efficient batching, and smart model selection
- Quality Metrics: Measuring and improving retrieval quality, answer faithfulness, relevance, and completeness
- Production Best Practices: Deployment strategies, versioning, A/B testing, and continuous improvement
Why this matters: A RAG system that works in a prototype but fails in production is useless. Production deployment requires solving real-world challenges: handling scale, ensuring reliability, monitoring quality, optimizing performance, and managing costs. These considerations determine whether your RAG system succeeds or fails in real-world use.
Key Concepts
Production RAG Considerations: Building Systems That Scale
Moving from a prototype RAG system to a production system requires addressing scalability, reliability, monitoring, and performance. This section covers the critical considerations for production deployment.
1. Scalability: Handling Growth
Document Scale
The challenge: Production RAG systems often need to handle millions or billions of documents. A system that works with 10,000 documents might completely fail at 10 million.
Solutions:
- Efficient indexing: Use vector databases with scalable indexing (HNSW, IVF-PQ) that can handle billions of vectors
- Distributed storage: Partition documents across multiple nodes/servers
- Incremental updates: Support adding/updating documents without rebuilding entire index
- Metadata partitioning: Use metadata to partition documents (e.g., by date, category) for faster search
Query Throughput
The challenge: Production systems need to handle hundreds or thousands of queries per second with consistent low latency.
Solutions:
- Horizontal scaling: Run multiple instances of your RAG service behind a load balancer
- Caching: Cache common queries and their results (see Performance Optimization below)
- Async processing: Use asynchronous operations to handle multiple queries concurrently
- Connection pooling: Reuse database connections instead of creating new ones for each query
Fast Retrieval (Sub-Second Latency)
Target: End-to-end latency (query → retrieval → generation → response) should be under 1-2 seconds for good user experience.
How to achieve:
- Optimized indexes: Use HNSW or similar fast indexing algorithms
- Limit retrieval scope: Use metadata filtering to reduce search space
- Efficient reranking: Rerank only top-k candidates (50-200), not entire collection
- Fast embedding models: Use smaller, faster embedding models when possible (trade-off with quality)
- CDN for static assets: Serve embeddings and models from CDN for faster access
Efficient Embedding Storage
The challenge: Storing embeddings for millions of documents requires significant storage. A 384-dimensional embedding is ~1.5KB, so 1M documents = ~1.5GB just for embeddings.
Solutions:
- Compression: Use product quantization (PQ) to compress embeddings (10-100x reduction)
- Deduplication: Store unique embeddings once, reference from multiple documents
- Tiered storage: Hot data (frequently accessed) in fast storage, cold data in cheaper storage
- Vector database optimization: Use databases that support efficient compression (FAISS, Qdrant)
2. Monitoring and Evaluation: Ensuring Quality
Retrieval Quality Metrics
Precision@k: Fraction of retrieved documents that are actually relevant. High precision = fewer irrelevant documents retrieved.
Recall@k: Fraction of relevant documents that were retrieved. High recall = fewer missed relevant documents.
How to measure:
- Manually label a test set (queries with known relevant documents)
- Run retrieval on test queries
- Calculate precision and recall for each query
- Track these metrics over time to detect degradation
Answer Quality Metrics
Faithfulness: Fraction of answer claims that are supported by the retrieved context. Measures whether the answer is grounded in the documents (not hallucinated).
Relevance: How well the answer addresses the query. Measured by semantic similarity between query and answer, or by human evaluation.
Completeness: Whether the answer fully addresses all parts of the query. Particularly important for multi-part questions.
How to measure:
- Automated: Use LLMs to evaluate faithfulness, relevance, completeness (LLM-as-judge)
- Human evaluation: Have humans rate answers on these dimensions (gold standard but expensive)
- Hybrid: Use automated evaluation for monitoring, human evaluation for critical cases
Operational Metrics
What to monitor:
- Latency: P50, P95, P99 latencies for retrieval and generation
- Throughput: Queries per second, successful vs failed requests
- Error rates: Percentage of queries that fail or timeout
- Cost: API costs (embedding API, LLM API), infrastructure costs
- Resource usage: CPU, memory, storage usage
Logging and Alerting
What to log:
- All queries and their responses (for debugging and improvement)
- Retrieved documents and their similarity scores
- Error messages and stack traces
- Performance metrics (latency, token counts, costs)
What to alert on:
- Quality degradation (precision/recall drops below threshold)
- High error rates (>1% failures)
- Latency spikes (P95 > 2 seconds)
- Cost anomalies (unexpected API cost increases)
3. Error Handling: Building Resilient Systems
Retrieval Failures
What can fail: Vector database connection, embedding API, timeout, index corruption
How to handle:
- Retry with exponential backoff: Transient failures often resolve on retry
- Fallback to cached results: If retrieval fails, use cached results for similar queries
- Graceful degradation: Return partial results or a helpful error message instead of crashing
- Circuit breakers: Stop calling failing services temporarily to prevent cascade failures
No Relevant Documents Found
The problem: Sometimes retrieval returns documents with very low similarity scores, or no documents at all.
How to handle:
- Similarity threshold: Only use documents above a minimum similarity score (e.g., 0.7)
- Fallback response: If no good documents found, return: "I couldn't find relevant information. Please rephrase your question."
- Query rewriting: Try query expansion or rewriting to find more documents
- Log for improvement: Track queries with no results to identify knowledge gaps
Context Quality Validation
What to validate:
- Similarity scores: Ensure retrieved documents have reasonable similarity (not all very low)
- Diversity: Check that retrieved documents aren't all duplicates or very similar
- Relevance: Use a quick relevance check (e.g., keyword matching) before sending to LLM
- Token limits: Ensure context fits within LLM's context window
Retry Mechanisms
When to retry: Transient failures (network timeouts, rate limits, temporary service unavailability)
Retry strategy:
- Exponential backoff: Wait 1s, then 2s, then 4s before retries
- Max retries: Limit to 3-5 retries to avoid long delays
- Idempotency: Ensure retries don't cause duplicate operations
Performance Optimization: Making RAG Fast and Efficient
1. Caching: Reducing Redundant Computation
Query Result Caching
What to cache: Cache the final answers for common queries. If the same query is asked multiple times, return the cached answer instead of re-running retrieval and generation.
Cache key: Use the exact query text (or normalized version) as the cache key
Cache TTL: Set appropriate time-to-live based on how often your documents change. Static documents: long TTL (hours/days). Frequently updated: short TTL (minutes).
Embedding Caching
What to cache: Cache document embeddings so you don't re-embed the same documents. Also cache query embeddings for common queries.
Benefits: Embedding generation is expensive (API costs, computation time). Caching can save 50-90% of embedding costs for repeated content.
Retrieval Result Caching
What to cache: Cache the top-k retrieved documents for common queries. Even if you regenerate the answer, you can reuse the same retrieved documents.
Benefits: Avoids expensive vector database queries for repeated queries.
2. Batch Processing: Processing Multiple Items Together
Embedding batching: Instead of embedding documents one-by-one, batch them together (e.g., 32-128 documents at a time). Most embedding APIs support batching and it's much more efficient.
Query batching: If you have multiple queries to process, batch them and process in parallel. This improves throughput significantly.
Index updates: When adding many documents, batch the index updates rather than updating one-by-one.
3. Async Operations: Parallelizing Independent Work
Parallel retrieval and generation: If you're using multiple retrieval strategies (dense + sparse), run them in parallel instead of sequentially.
Async LLM calls: If generating answers for multiple queries, use async/await to process them concurrently.
Pipeline parallelism: While one query is being processed by the LLM, start processing the next query's retrieval.
4. Model Selection: Balancing Quality and Latency
Embedding models: Smaller models (e.g., all-MiniLM-L6-v2, 384 dims) are faster but may have lower quality. Larger models (e.g., all-mpnet-base-v2, 768 dims) are slower but higher quality. Choose based on your latency requirements.
LLM selection: For generation, smaller/faster models (GPT-3.5-turbo) are faster and cheaper but may have lower quality. Larger models (GPT-4) are slower and more expensive but higher quality. Consider using smaller models for simple queries, larger for complex ones.
Reranking models: Cross-encoder reranking is slower but more accurate. Consider skipping reranking for simple queries, using it only for complex ones.
5. Additional Optimizations
- Connection pooling: Reuse database connections instead of creating new ones
- Pre-warming: Load models and indexes into memory at startup to avoid cold starts
- CDN for static assets: Serve embedding models and other static files from CDN
- Database query optimization: Use appropriate indexes, limit result sets, use pagination
- Monitoring and profiling: Identify bottlenecks through profiling and optimize the slowest parts
Mathematical Formulations
Production RAG Evaluation Metrics
Measuring RAG system performance requires quantitative metrics for both retrieval quality and answer quality. These formulas provide standardized ways to evaluate, monitor, and improve production RAG systems. Understanding these metrics is essential for ensuring your system meets quality standards.
1. Retrieval Precision@k
What This Measures:
Precision@k measures the quality of retrieval - what fraction of the top-k retrieved documents are actually relevant to the query. High precision means you're retrieving mostly relevant documents (few false positives).
Breaking It Down:
- \(\{\text{relevant docs}\}\): Set of all documents in the knowledge base that are actually relevant to the query (ground truth, typically labeled by humans)
- \(\{\text{retrieved top-k}\}\): Set of the top-k documents returned by the retrieval system
- \(\cap\): Set intersection - documents that are both relevant AND retrieved
- \(|\ldots|\): Cardinality (size) of the set
- \(k\): Number of documents retrieved (e.g., k=5 means top-5 documents)
Interpretation:
- Precision@k = 1.0: All retrieved documents are relevant (perfect precision, no false positives)
- Precision@k = 0.8: 80% of retrieved documents are relevant (good precision)
- Precision@k = 0.5: Only 50% of retrieved documents are relevant (poor precision, many false positives)
- Precision@k = 0.0: None of the retrieved documents are relevant (worst case)
Example:
Query: "What is Python?"
Relevant documents (ground truth): {doc1, doc2, doc3, doc4, doc5}
Retrieved top-5: {doc1, doc6, doc2, doc7, doc3}
Intersection: {doc1, doc2, doc3} (3 documents are both relevant and retrieved)
Precision@5: \(\frac{3}{5} = 0.6\) (60% of retrieved docs are relevant)
Why Precision Matters:
High precision means the LLM receives mostly relevant context, leading to better answers. Low precision means the LLM gets irrelevant context, which can cause confusion or hallucination.
Typical Values:
- Production systems: Precision@5 = 0.7-0.9 (70-90% of top-5 are relevant)
- Good systems: Precision@5 = 0.8-0.95
- Excellent systems: Precision@5 > 0.9
2. Retrieval Recall@k
What This Measures:
Recall@k measures the coverage of retrieval - what fraction of all relevant documents were actually retrieved. High recall means you're finding most relevant documents (few false negatives).
Breaking It Down:
- \(\{\text{relevant docs}\}\): Set of all documents that are actually relevant (ground truth)
- \(\{\text{retrieved top-k}\}\): Set of top-k documents retrieved
- \(\{\text{relevant docs}\} \cap \{\text{retrieved top-k}\}\): Relevant documents that were successfully retrieved
- \(|\{\text{relevant docs}\}|\): Total number of relevant documents (denominator)
Interpretation:
- Recall@k = 1.0: All relevant documents were retrieved (perfect recall, no false negatives)
- Recall@k = 0.8: 80% of relevant documents were retrieved (good recall)
- Recall@k = 0.5: Only 50% of relevant documents were retrieved (poor recall, many missed)
- Recall@k = 0.0: No relevant documents were retrieved (worst case)
Example:
Query: "What is Python?"
Relevant documents: {doc1, doc2, doc3, doc4, doc5} (5 total relevant)
Retrieved top-5: {doc1, doc6, doc2, doc7, doc3}
Intersection: {doc1, doc2, doc3} (3 relevant docs retrieved)
Recall@5: \(\frac{3}{5} = 0.6\) (60% of relevant docs were found)
❌ Problem: doc4 and doc5 are relevant but weren't retrieved (missed 40% of relevant docs)
Precision vs Recall Trade-off:
- High precision, low recall: Retrieved docs are very relevant, but you miss many relevant docs
- Low precision, high recall: You find most relevant docs, but also retrieve many irrelevant ones
- Ideal: High precision AND high recall (retrieve mostly relevant docs AND find most relevant docs)
Why Recall Matters:
Low recall means you're missing relevant information. If the answer requires information from doc4 and doc5, but they weren't retrieved, the LLM can't generate a complete answer.
Typical Values:
- Production systems: Recall@10 = 0.6-0.8 (60-80% of relevant docs found in top-10)
- Good systems: Recall@10 = 0.7-0.9
- Note: Recall typically increases with k (Recall@10 > Recall@5)
3. F1 Score (Harmonic Mean of Precision and Recall)
What This Measures:
F1 score is the harmonic mean of precision and recall, providing a single metric that balances both. It's useful when you need one number to summarize retrieval quality.
Breaking It Down:
- Harmonic mean: More conservative than arithmetic mean - penalizes systems that are good at one metric but poor at the other
- Range: [0, 1], where 1.0 is perfect (both precision and recall are 1.0)
- F1 is high only when BOTH precision and recall are high
Why Harmonic Mean?
Arithmetic mean can be misleading. A system with Precision=0.9, Recall=0.1 has arithmetic mean 0.5, but F1=0.18 (correctly identifies it as poor). Harmonic mean penalizes imbalance.
Example:
System A: Precision@5=0.9, Recall@5=0.5
F1@5: \(2 \times \frac{0.9 \times 0.5}{0.9 + 0.5} = 2 \times \frac{0.45}{1.4} = 0.64\)
System B: Precision@5=0.7, Recall@5=0.7
F1@5: \(2 \times \frac{0.7 \times 0.7}{0.7 + 0.7} = 2 \times \frac{0.49}{1.4} = 0.70\)
✅ System B has higher F1 despite lower precision, because it's more balanced.
4. Answer Faithfulness
What This Measures:
Faithfulness measures how well the answer is grounded in the retrieved context. It's the fraction of claims/facts in the answer that are supported by the context. High faithfulness means the answer is based on the documents (not hallucinated).
Breaking It Down:
- \(\{\text{claims in answer}\}\): Set of factual claims made in the generated answer (e.g., "Paris is the capital", "France is in Europe")
- \(\{\text{claims in context}\}\): Set of factual claims present in the retrieved context
- \(\{\text{claims in answer}\} \cap \{\text{claims in context}\}\): Claims that appear in both answer and context (supported claims)
- \(|\{\text{claims in answer}\}|\): Total number of claims in the answer (denominator)
Interpretation:
- Faithfulness = 1.0: All answer claims are supported by context (perfect grounding, no hallucinations)
- Faithfulness = 0.8: 80% of claims are supported (good, minor hallucinations)
- Faithfulness = 0.5: Only 50% of claims are supported (poor, significant hallucinations)
- Faithfulness = 0.0: No claims are supported (worst case, answer is completely hallucinated)
Example:
Query: "What is the capital of France?"
Context: "France is a country in Europe. Its capital city is Paris."
Answer: "The capital of France is Paris. France is located in Europe."
Claims in answer: {"capital of France is Paris", "France is in Europe"}
Claims in context: {"France is a country", "France is in Europe", "capital is Paris"}
Supported claims: {"France is in Europe", "capital is Paris"} (2 out of 2)
Faithfulness: \(\frac{2}{2} = 1.0\) (perfect - all claims supported)
Bad example:
Answer: "The capital of France is Paris. France has a population of 70 million."
Claims: {"capital is Paris", "population is 70 million"}
Supported: {"capital is Paris"} (1 out of 2)
Faithfulness: \(\frac{1}{2} = 0.5\) (50% - population claim is hallucinated, not in context)
Why Faithfulness is Critical:
Low faithfulness means the LLM is making up information not in the retrieved documents. This defeats the purpose of RAG (grounding answers in documents). High faithfulness ensures answers are factual and verifiable.
Typical Values:
- Production systems: Faithfulness = 0.85-0.95 (85-95% of claims supported)
- Good systems: Faithfulness > 0.9
- Critical systems: Faithfulness > 0.95 (medical, legal, financial domains)
5. Answer Relevance
What This Measures:
Relevance measures how well the answer addresses the query using semantic similarity. It's calculated as the cosine similarity between the query embedding and answer embedding. High relevance means the answer is semantically similar to what was asked.
Breaking It Down:
- \(E(\text{query})\): Embedding vector of the user's query
- \(E(\text{answer})\): Embedding vector of the generated answer
- \(\text{cosine}(\ldots)\): Cosine similarity between the two embeddings (range: [-1, 1], typically [0, 1] for normalized embeddings)
Intuition:
If the answer is relevant to the query, their embeddings should point in similar directions in vector space, resulting in high cosine similarity. If the answer is off-topic, embeddings point in different directions, resulting in low similarity.
Example:
Query: "What is machine learning?"
Relevant answer: "Machine learning is a subset of artificial intelligence that enables computers to learn from data without explicit programming."
Cosine similarity: 0.92 (very high - answer directly addresses the query)
Irrelevant answer: "The weather today is sunny with a high of 75 degrees."
Cosine similarity: 0.15 (very low - answer is completely off-topic)
Partially relevant answer: "Artificial intelligence includes various techniques."
Cosine similarity: 0.65 (moderate - somewhat related but not directly answering)
Why This Matters:
Even if an answer is faithful (grounded in context), it might not be relevant to the query. For example, if someone asks "What is Python?" and the system answers "Python is a snake species," the answer is faithful to some context but not relevant to the programming question.
Typical Values:
- Highly relevant: Relevance > 0.85
- Moderately relevant: Relevance = 0.7-0.85
- Low relevance: Relevance < 0.7
6. Answer Completeness
What This Measures:
Completeness measures whether the answer addresses all parts of a multi-part or complex query. A query might have multiple aspects, and completeness checks if all aspects were covered.
Breaking It Down:
- \(\{\text{all query aspects}\}\): Set of all distinct questions or topics in the query
- \(\{\text{query aspects addressed}\}\): Set of aspects that the answer actually covers
- Completeness: Fraction of query aspects that were addressed
Example:
Query: "What is the capital of France and what is its population?"
Query aspects: {"capital of France", "population of France"}
Complete answer: "The capital of France is Paris. France has a population of approximately 67 million people."
Aspects addressed: {"capital of France", "population of France"} (2 out of 2)
Completeness: \(\frac{2}{2} = 1.0\) (100% complete)
Incomplete answer: "The capital of France is Paris."
Aspects addressed: {"capital of France"} (1 out of 2)
Completeness: \(\frac{1}{2} = 0.5\) (50% complete - missing population)
Why Completeness Matters:
Users ask questions expecting complete answers. If a query has multiple parts and only some are answered, the user experience is poor. Completeness is especially important for complex, multi-hop questions.
Detailed Examples
Step-by-Step Examples
Example: Evaluating Retrieval
Query: "What is Python?"
Relevant documents: ["Python is a programming language", "Python tutorial"]
Retrieved top-3: ["Python is a programming language", "Java tutorial", "Python tutorial"]
Precision@3: 2/3 = 0.67 (2 relevant out of 3 retrieved)
Recall@3: 2/2 = 1.0 (all relevant docs retrieved)
Example: Evaluating Generation
Query: "What is the capital of France?"
Context: "France is a country. Its capital is Paris."
Generated answer: "The capital of France is Paris."
Faithfulness: 1.0 (answer fully supported by context)
Relevance: 0.95 (high semantic similarity to query)
Completeness: 1.0 (answer is complete)
Implementation
Implementation Overview
This section provides practical Python code examples for evaluating RAG systems and building production-ready implementations with error handling, monitoring, caching, and scalability features. These implementations are essential for deploying RAG systems in real-world environments.
1. Complete RAG Evaluation System
What this does: Implements comprehensive evaluation metrics for both retrieval and generation quality, including precision, recall, faithfulness, relevance, and completeness.
from typing import List, Set, Dict
from sentence_transformers import SentenceTransformer
import numpy as np
class RAGEvaluator:
"""
Comprehensive RAG system evaluator.
Evaluates both retrieval quality (precision, recall) and
generation quality (faithfulness, relevance, completeness).
"""
def __init__(self):
self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
def evaluate_retrieval(self, relevant_doc_ids: Set[str],
retrieved_doc_ids: List[str], k: int = 10) -> Dict:
"""
Evaluate retrieval quality.
Args:
relevant_doc_ids: Set of document IDs that are actually relevant
retrieved_doc_ids: List of retrieved document IDs (in order)
k: Number of top-k to evaluate
Returns:
Dictionary with precision@k, recall@k, f1@k
"""
# Get top-k retrieved
top_k_retrieved = set(retrieved_doc_ids[:k])
# Calculate metrics
relevant_retrieved = relevant_doc_ids & top_k_retrieved
precision = len(relevant_retrieved) / k if k > 0 else 0
recall = len(relevant_retrieved) / len(relevant_doc_ids) if relevant_doc_ids else 0
f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
return {
'precision@k': precision,
'recall@k': recall,
'f1@k': f1,
'relevant_retrieved': len(relevant_retrieved),
'total_relevant': len(relevant_doc_ids),
'k': k
}
def evaluate_faithfulness(self, answer: str, context: str) -> float:
"""
Evaluate if answer claims are supported by context.
Args:
answer: Generated answer
context: Retrieved context
Returns:
Faithfulness score [0, 1]
"""
# Extract key claims from answer (simplified - use NER in production)
answer_claims = self._extract_claims(answer)
context_claims = self._extract_claims(context)
# Check how many answer claims appear in context
supported = len(set(answer_claims) & set(context_claims))
faithfulness = supported / len(answer_claims) if answer_claims else 1.0
return faithfulness
def evaluate_relevance(self, query: str, answer: str) -> float:
"""
Evaluate semantic relevance between query and answer.
Args:
query: User query
answer: Generated answer
Returns:
Relevance score [0, 1] (cosine similarity)
"""
query_emb = self.embedder.encode([query])
answer_emb = self.embedder.encode([answer])
# Cosine similarity
similarity = np.dot(query_emb[0], answer_emb[0]) / (
np.linalg.norm(query_emb[0]) * np.linalg.norm(answer_emb[0])
)
# Normalize to [0, 1] (assuming normalized embeddings)
relevance = (similarity + 1) / 2
return float(relevance)
def evaluate_completeness(self, query: str, answer: str) -> float:
"""
Evaluate if answer addresses all aspects of query.
Args:
query: User query (may have multiple parts)
answer: Generated answer
Returns:
Completeness score [0, 1]
"""
# Extract query aspects (simplified)
query_aspects = self._extract_query_aspects(query)
answered_aspects = self._check_aspects_answered(query_aspects, answer)
completeness = answered_aspects / len(query_aspects) if query_aspects else 1.0
return completeness
def _extract_claims(self, text: str) -> List[str]:
"""Extract factual claims from text (simplified)."""
# In production, use NER, dependency parsing, or LLM-based extraction
sentences = text.split('.')
return [s.strip() for s in sentences if len(s.strip()) > 10]
def _extract_query_aspects(self, query: str) -> List[str]:
"""Extract different aspects/questions from query."""
# Simple: split by "and", "or", etc.
aspects = re.split(r'\s+and\s+|\s+or\s+', query.lower())
return [a.strip() for a in aspects if a.strip()]
def _check_aspects_answered(self, aspects: List[str], answer: str) -> int:
"""Check how many aspects are addressed in answer."""
answer_lower = answer.lower()
answered = sum(1 for aspect in aspects if any(word in answer_lower for word in aspect.split()))
return answered
def evaluate_complete(self, query: str, answer: str, context: str,
relevant_doc_ids: Set[str], retrieved_doc_ids: List[str]) -> Dict:
"""
Complete evaluation of RAG system.
Returns:
Dictionary with all evaluation metrics
"""
retrieval_metrics = self.evaluate_retrieval(relevant_doc_ids, retrieved_doc_ids)
faithfulness = self.evaluate_faithfulness(answer, context)
relevance = self.evaluate_relevance(query, answer)
completeness = self.evaluate_completeness(query, answer)
# Overall quality score
quality = 0.4 * faithfulness + 0.4 * relevance + 0.2 * completeness
return {
**retrieval_metrics,
'faithfulness': faithfulness,
'relevance': relevance,
'completeness': completeness,
'overall_quality': quality
}
# Example usage
evaluator = RAGEvaluator()
# Example evaluation
relevant = {'doc1', 'doc2', 'doc3'}
retrieved = ['doc1', 'doc4', 'doc2', 'doc5', 'doc6']
metrics = evaluator.evaluate_retrieval(relevant, retrieved, k=5)
print("Retrieval Metrics:")
print(f"Precision@5: {metrics['precision@k']:.3f}")
print(f"Recall@5: {metrics['recall@k']:.3f}")
print(f"F1@5: {metrics['f1@k']:.3f}")
query = "What is machine learning?"
answer = "Machine learning is a subset of AI that learns from data."
context = "Machine learning is a subset of artificial intelligence. It enables computers to learn from data."
faithfulness = evaluator.evaluate_faithfulness(answer, context)
relevance = evaluator.evaluate_relevance(query, answer)
print(f"\nGeneration Metrics:")
print(f"Faithfulness: {faithfulness:.3f}")
print(f"Relevance: {relevance:.3f}")
Key Points:
- Retrieval metrics: Precision, recall, F1 measure retrieval quality
- Generation metrics: Faithfulness, relevance, completeness measure answer quality
- Production use: Run evaluation on test sets to monitor system performance
2. Production RAG System with Monitoring and Error Handling
What this does: Implements a production-ready RAG system with comprehensive error handling, caching, logging, monitoring, and fallback mechanisms.
import time
import logging
from typing import Optional, Dict, List
from functools import lru_cache
import json
class ProductionRAG:
"""
Production-ready RAG system with error handling, caching,
monitoring, and scalability features.
"""
def __init__(self, embedder, vector_db, llm, cache_size: int = 1000):
"""
Initialize production RAG system.
Args:
embedder: Embedding model
vector_db: Vector database client
llm: Language model client
cache_size: Maximum cache size
"""
self.embedder = embedder
self.vector_db = vector_db
self.llm = llm
self.cache = {}
self.cache_size = cache_size
# Setup logging
logging.basicConfig(level=logging.INFO)
self.logger = logging.getLogger(__name__)
# Metrics tracking
self.metrics = {
'total_queries': 0,
'cache_hits': 0,
'errors': 0,
'avg_latency': 0.0
}
def query(self, question: str, top_k: int = 5,
use_cache: bool = True, timeout: float = 10.0) -> Dict:
"""
Query RAG system with error handling and monitoring.
Args:
question: User question
top_k: Number of documents to retrieve
use_cache: Whether to use caching
timeout: Maximum time allowed for query
Returns:
Dictionary with 'answer', 'sources', 'latency', 'cached'
"""
start_time = time.time()
self.metrics['total_queries'] += 1
try:
# Check cache
if use_cache and question in self.cache:
self.metrics['cache_hits'] += 1
cached_result = self.cache[question]
cached_result['cached'] = True
cached_result['latency'] = time.time() - start_time
self.logger.info(f"Cache hit for query: {question[:50]}...")
return cached_result
# Retrieve with timeout
contexts = self._retrieve_with_timeout(question, top_k, timeout/2)
if not contexts:
return {
'answer': "I couldn't find relevant information to answer your question.",
'sources': [],
'latency': time.time() - start_time,
'cached': False,
'error': None
}
# Generate with timeout
answer = self._generate_with_timeout(question, contexts, timeout/2)
# Format result
result = {
'answer': answer,
'sources': contexts,
'latency': time.time() - start_time,
'cached': False,
'error': None
}
# Cache result
if use_cache:
self._add_to_cache(question, result)
# Update metrics
self._update_metrics(result['latency'])
return result
except Exception as e:
self.metrics['errors'] += 1
self.logger.error(f"Error processing query: {e}", exc_info=True)
return {
'answer': "I encountered an error processing your question. Please try again.",
'sources': [],
'latency': time.time() - start_time,
'cached': False,
'error': str(e)
}
def _retrieve_with_timeout(self, question: str, top_k: int, timeout: float) -> List[str]:
"""Retrieve with timeout protection."""
try:
query_embedding = self.embedder.encode([question])
results = self.vector_db.search(query_embedding, top_k=top_k, timeout=timeout)
return [r['text'] for r in results if r.get('score', 0) > 0.7]
except Exception as e:
self.logger.warning(f"Retrieval error: {e}")
return []
def _generate_with_timeout(self, question: str, contexts: List[str], timeout: float) -> str:
"""Generate answer with timeout protection."""
try:
context = "\n\n".join(contexts)
prompt = f"Context: {context}\n\nQuestion: {question}\n\nAnswer:"
answer = self.llm.generate(prompt, timeout=timeout)
return answer
except Exception as e:
self.logger.warning(f"Generation error: {e}")
return "I couldn't generate an answer. Please try rephrasing your question."
def _add_to_cache(self, question: str, result: Dict):
"""Add result to cache with size limit."""
if len(self.cache) >= self.cache_size:
# Remove oldest entry (FIFO)
oldest_key = next(iter(self.cache))
del self.cache[oldest_key]
# Store result without latency (will be recalculated)
cache_entry = {k: v for k, v in result.items() if k != 'latency'}
self.cache[question] = cache_entry
def _update_metrics(self, latency: float):
"""Update average latency metric."""
total = self.metrics['total_queries']
current_avg = self.metrics['avg_latency']
self.metrics['avg_latency'] = (current_avg * (total - 1) + latency) / total
def get_metrics(self) -> Dict:
"""Get current system metrics."""
cache_hit_rate = (self.metrics['cache_hits'] / self.metrics['total_queries']
if self.metrics['total_queries'] > 0 else 0)
error_rate = (self.metrics['errors'] / self.metrics['total_queries']
if self.metrics['total_queries'] > 0 else 0)
return {
**self.metrics,
'cache_hit_rate': cache_hit_rate,
'error_rate': error_rate
}
def health_check(self) -> Dict:
"""Perform health check on system components."""
health = {
'status': 'healthy',
'components': {}
}
# Check embedder
try:
test_emb = self.embedder.encode(["test"])
health['components']['embedder'] = 'healthy'
except Exception as e:
health['components']['embedder'] = f'unhealthy: {e}'
health['status'] = 'degraded'
# Check vector DB
try:
# Try a simple query
test_results = self.vector_db.search([[0.0] * 384], top_k=1)
health['components']['vector_db'] = 'healthy'
except Exception as e:
health['components']['vector_db'] = f'unhealthy: {e}'
health['status'] = 'degraded'
# Check LLM
try:
# In production, use actual health check endpoint
health['components']['llm'] = 'healthy'
except Exception as e:
health['components']['llm'] = f'unhealthy: {e}'
health['status'] = 'unhealthy'
return health
# Example usage
# rag = ProductionRAG(embedder, vector_db, llm)
# result = rag.query("What is machine learning?", top_k=5)
# print(f"Answer: {result['answer']}")
# print(f"Latency: {result['latency']:.2f}s")
# print(f"Cached: {result['cached']}")
#
# # Monitor system
# metrics = rag.get_metrics()
# print(f"Cache hit rate: {metrics['cache_hit_rate']:.2%}")
# print(f"Average latency: {metrics['avg_latency']:.2f}s")
#
# # Health check
# health = rag.health_check()
# print(f"System status: {health['status']}")
Key Points:
- Error handling: Graceful degradation on failures
- Caching: Reduces latency and costs for repeated queries
- Monitoring: Tracks metrics for performance analysis
- Health checks: Ensures system components are operational
- Timeouts: Prevents hanging on slow operations
Installation Requirements
Install required packages:
pip install sentence-transformers numpy
Note: For production, add proper logging infrastructure (e.g., ELK stack), monitoring (e.g., Prometheus), and distributed caching (e.g., Redis).
Real-World Applications
Production RAG Systems
Enterprise knowledge bases:
- Internal documentation search (Confluence, Notion)
- Company policy Q&A systems
- Technical support knowledge bases
Customer-facing applications:
- E-commerce product Q&A
- FAQ chatbots
- Help center assistants
Research and analysis:
- Legal document analysis systems
- Medical literature Q&A
- Academic paper search and summarization
Best Practices
Deployment: Use managed vector databases, implement caching, monitor performance
Quality: Regular evaluation, A/B testing, continuous improvement
Reliability: Error handling, fallbacks, retries, graceful degradation
Security: Access control, data privacy, input validation