Chapter 3: Document Processing & Chunking
Preparing Documents for Retrieval
Learning Objectives
- Understand document processing & chunking fundamentals
- Master the mathematical foundations
- Learn practical implementation
- Apply knowledge through examples
- Recognize real-world applications
Document Processing & Chunking
Why Document Processing is Critical in RAG
Before you can retrieve relevant information, you need to prepare your documents. Real-world documents come in many formats (PDFs, Word docs, HTML pages, markdown files, databases) and sizes (from short paragraphs to entire books with thousands of pages). Document processing and chunking is the crucial first step that transforms raw documents into a format that RAG systems can effectively search and retrieve.
The fundamental challenge: Most documents are too large to process as a single unit. LLMs have context window limits (typically 4K-128K tokens), and embedding models work best with text segments of 128-512 tokens. A single research paper might be 10,000+ words, and a technical manual could be hundreds of pages. You can't embed or process these as single units.
What Document Processing & Chunking Involves
- Document Loading: Extract text from various formats (PDF, HTML, Word, etc.) while preserving structure and metadata
- Text Cleaning: Remove formatting artifacts, handle special characters, normalize whitespace
- Chunking: Split large documents into smaller, semantically meaningful pieces (chunks) that fit within context windows
- Metadata Extraction: Capture document properties (title, author, date, section, category) for filtering and organization
- Context Preservation: Maintain relationships between chunks (overlap, hierarchy) so retrieved chunks have sufficient context
Real-World Example:
Imagine you have a 200-page technical manual about machine learning. Without proper processing:
- ❌ The entire document is too large to embed meaningfully (loses semantic precision)
- ❌ Retrieval returns the whole manual even for specific questions (wastes tokens, poor accuracy)
- ❌ LLM struggles to find relevant information in 200 pages of text
With proper chunking:
- ✅ Manual is split into 500 focused chunks (one per section/topic)
- ✅ Each chunk is embedded separately and stored in vector database
- ✅ Query about "gradient descent" retrieves only the 2-3 most relevant chunks
- ✅ LLM receives focused, relevant context and generates accurate answers
Key Concepts You'll Learn
- Chunking Strategies: Fixed-size, sentence-based, semantic, and recursive chunking - when to use each and why
- Chunk Overlap: Why overlapping chunks preserve context at boundaries and how to choose the right overlap percentage
- Optimal Chunk Size: Balancing retrieval precision (smaller chunks) with context completeness (larger chunks)
- Document Type Handling: Processing PDFs, HTML, markdown, and structured documents with appropriate parsers
- Metadata Management: Extracting and storing document properties for filtering and organization
- Context Preservation: Techniques to maintain semantic relationships between chunks
Why this matters: Poor chunking leads to poor retrieval. If chunks are too large, retrieval is imprecise. If chunks are too small, context is lost. If chunks break at wrong boundaries, semantic meaning is destroyed. Getting chunking right is foundational to RAG system performance.
Key Concepts
Why Chunking is Critical for RAG Systems
The Core Problem
Raw documents in real-world RAG systems are often extremely long—think research papers (10,000+ words), legal documents (hundreds of pages), technical documentation (thousands of sections), or entire books. These documents present several fundamental challenges:
- Context Window Limits: Most LLMs have fixed context windows (e.g., GPT-4: 8K-128K tokens, Claude: 100K-200K tokens). A single large document can easily exceed these limits, making it impossible to process the entire document at once.
- Embedding Model Constraints: Embedding models like SentenceTransformers work best with text segments of 128-512 tokens. Very long documents produce embeddings that lose semantic precision—the model struggles to capture the meaning of a 10,000-word document in a single 384-dimensional vector.
- Retrieval Precision: When you retrieve a 50-page document for a specific question, most of that document is irrelevant. The LLM has to sift through thousands of words to find the answer, leading to poor performance and high costs.
- Computational Efficiency: Processing entire documents is computationally expensive. Smaller chunks allow for faster embedding generation, more efficient storage, and quicker retrieval.
The Solution: Intelligent Chunking
Chunking splits large documents into smaller, manageable pieces that:
- Fit within context limits: Each chunk is small enough to fit comfortably in the LLM's context window, even when combined with the query and other chunks.
- Are semantically meaningful: Each chunk represents a coherent unit of information (a paragraph, a section, a concept) rather than arbitrary text splits.
- Can be retrieved independently: Each chunk can be embedded and stored separately, allowing the retrieval system to find the most relevant chunk(s) for a specific query.
- Maintain context when possible: Chunks preserve surrounding context (through overlap or metadata) so the LLM understands the broader context when generating answers.
Real-World Example:
Imagine you have a 200-page technical manual about machine learning. Without chunking:
- ❌ The entire document is too large to embed meaningfully
- ❌ Retrieval returns the whole manual, even if the question is about a specific algorithm
- ❌ The LLM wastes tokens processing irrelevant sections
With intelligent chunking:
- ✅ The manual is split into 500 chunks (one per section/topic)
- ✅ Each chunk is embedded separately and stored in the vector database
- ✅ A query about "gradient descent" retrieves only the 2-3 most relevant chunks
- ✅ The LLM receives focused, relevant context and generates accurate answers
Chunking Strategies: Choosing the Right Approach
Different chunking strategies serve different purposes. The choice depends on your document type, use case, and performance requirements.
1. Fixed-Size Chunking
What it is: Splits documents into chunks of a fixed size (measured in characters or tokens), regardless of content structure.
How it works:
- Divide the document into equal-sized segments (e.g., 500 characters or 200 tokens)
- Each chunk has exactly the same size (except possibly the last chunk)
- No consideration for sentence boundaries, paragraphs, or semantic meaning
Advantages:
- ✅ Simple and fast: Very easy to implement, no complex logic needed
- ✅ Predictable: You know exactly how many chunks you'll get for any document size
- ✅ Efficient storage: Uniform chunk sizes make storage and indexing straightforward
- ✅ Good for uniform content: Works well when documents have consistent structure
Disadvantages:
- ❌ May break sentences: A chunk might end mid-sentence, losing meaning
- ❌ Ignores semantic boundaries: A single concept might be split across two chunks
- ❌ Poor for structured content: Doesn't respect paragraphs, sections, or logical divisions
When to use: When you have uniform, unstructured text where semantic boundaries don't matter much, or when you need maximum speed and simplicity.
2. Sentence-Based Chunking
What it is: Splits documents at sentence boundaries, grouping multiple sentences into chunks of roughly equal size.
How it works:
- Identify sentence boundaries using NLP tools (NLTK, spaCy, or regex)
- Group sentences together until reaching a target chunk size (e.g., 5-10 sentences or 200-500 tokens)
- Each chunk contains complete sentences, never breaking mid-sentence
Advantages:
- ✅ Preserves sentence integrity: Sentences remain intact, maintaining grammatical and semantic coherence
- ✅ Better semantic coherence: Related sentences stay together, improving embedding quality
- ✅ Respects natural boundaries: Works with how humans structure information
- ✅ Good for narrative content: Excellent for articles, stories, and prose
Disadvantages:
- ❌ Variable chunk sizes: Chunks may vary significantly in size depending on sentence length
- ❌ May split related concepts: A concept spanning multiple sentences might be split across chunks
- ❌ Requires sentence detection: Needs reliable sentence segmentation (can fail with abbreviations, decimals, etc.)
When to use: For narrative text, articles, blog posts, or any content where sentence boundaries matter. This is often the default choice for general-purpose RAG systems.
3. Semantic Chunking (Advanced)
What it is: Uses embeddings and similarity calculations to group semantically related sentences together, creating chunks based on meaning rather than size.
How it works:
- Embed each sentence (or small group of sentences) into vector space
- Calculate similarity between consecutive sentences
- When similarity drops below a threshold, that's a chunk boundary (new topic/concept)
- Group similar sentences together until reaching a maximum chunk size
Advantages:
- ✅ Most semantically coherent: Chunks represent complete concepts or topics
- ✅ Adaptive to content: Automatically adjusts to document structure
- ✅ Better retrieval quality: Chunks are more likely to be fully relevant or fully irrelevant
- ✅ Respects topic boundaries: Natural breaks occur at topic transitions
Disadvantages:
- ❌ Computationally expensive: Requires embedding every sentence, then calculating similarities
- ❌ More complex to implement: Needs careful tuning of similarity thresholds
- ❌ Variable chunk sizes: Can produce very small or very large chunks
- ❌ Requires embedding model: Needs a good sentence embedding model to work well
When to use: For high-quality RAG systems where retrieval precision matters more than speed. Ideal for technical documentation, research papers, or any content with clear topic boundaries.
4. Recursive Chunking (Hierarchical)
What it is: A hybrid approach that tries multiple chunking strategies in a hierarchy (e.g., try paragraphs first, then sentences, then fixed-size).
How it works:
- First, try to split by paragraphs (if they exist and are reasonable size)
- If paragraphs are too large, split by sentences
- If sentences are still too large, use fixed-size chunking as fallback
- Maintains hierarchy: parent chunks contain metadata about child chunks
Advantages:
- ✅ Adaptive: Automatically chooses the best strategy for each part of the document
- ✅ Respects structure: Uses document structure when available
- ✅ Robust: Falls back gracefully when structure is missing
When to use: When you have diverse document types with varying structures. Popular in production RAG systems (used by LangChain, LlamaIndex).
Chunk Overlap: Preserving Context at Boundaries
Why Overlap is Essential
When you split a document into chunks, you create boundaries between chunks. These boundaries are artificial—they don't exist in the original document. This creates a critical problem:
The Boundary Problem: Important information often appears at the edges of chunks. Consider this example:
Example: The Boundary Problem
Original Document:
"Machine learning models require careful tuning. Hyperparameters like learning rate and batch size significantly impact model performance. Regularization techniques help prevent overfitting."
Without Overlap (Bad):
- Chunk 1: "Machine learning models require careful tuning. Hyperparameters like learning rate and batch size significantly impact model performance."
- Chunk 2: "Regularization techniques help prevent overfitting."
❌ If a query asks about "hyperparameters and regularization," Chunk 1 might be retrieved (mentions hyperparameters) but Chunk 2 (mentions regularization) might not be, even though they're related concepts.
With Overlap (Good):
- Chunk 1: "Machine learning models require careful tuning. Hyperparameters like learning rate and batch size significantly impact model performance. Regularization techniques help prevent overfitting."
- Chunk 2: "Hyperparameters like learning rate and batch size significantly impact model performance. Regularization techniques help prevent overfitting."
✅ Now both chunks contain information about hyperparameters AND regularization, improving retrieval quality.
How Overlap Works
Overlap means that consecutive chunks share some content at their boundaries. For example, with 20% overlap:
- If chunk size is 500 tokens, overlap is 100 tokens
- Chunk 1: tokens 1-500
- Chunk 2: tokens 401-900 (starts at token 401, overlapping the last 100 tokens of Chunk 1)
- Chunk 3: tokens 801-1300 (overlaps with Chunk 2)
Choosing the Right Overlap
Typical overlap: 10-20% of chunk size is standard. Here's why:
- Too little overlap (0-5%): ❌ Doesn't solve the boundary problem. Important context at boundaries is still lost. Not recommended.
- Moderate overlap (10-20%): ✅ Good balance. Preserves context without excessive storage overhead. This is the sweet spot for most use cases.
- High overlap (30-50%): ⚠️ Better context preservation but significantly increases storage costs. Use only when context preservation is critical (e.g., legal documents, medical records).
- Very high overlap (50%+): ❌ Wasteful. You're essentially storing the document twice. Rarely justified.
Trade-offs
Benefits of overlap:
- ✅ Preserves context at boundaries
- ✅ Improves retrieval quality (related concepts stay together)
- ✅ Reduces risk of missing relevant information
- ✅ Better for queries that span multiple topics
Costs of overlap:
- ❌ Increased storage: More chunks = more embeddings to store
- ❌ Higher embedding costs: More chunks to embed (if using paid APIs)
- ❌ Potential redundancy: Same information retrieved multiple times (though this is usually acceptable)
Best Practices
- Start with 10-20% overlap: This works well for most documents
- Increase for critical documents: Use 20-30% for legal, medical, or financial documents where context is crucial
- Decrease for uniform content: Use 5-10% for structured data or lists where boundaries are less important
- Test and measure: Evaluate retrieval quality with different overlap percentages on your specific documents
Mathematical Formulations
Chunking Mathematics Overview
Chunking involves several mathematical considerations: calculating the number of chunks needed, determining overlap, and understanding storage efficiency. These formulas help you make informed decisions about chunk size and overlap for optimal RAG performance.
1. Chunk Count Calculation
What This Formula Calculates:
This formula determines how many chunks you'll get when splitting a document of a given length into chunks of a specified size with overlap. The ceiling function (\(\lceil \rceil\)) ensures you round up to include any partial chunk at the end.
Breaking It Down:
- \(\text{document\_length}\): Total length of the document measured in characters or tokens (e.g., 10,000 characters or 2,500 tokens)
- \(\text{chunk\_size}\): Desired size of each chunk (e.g., 500 characters or 200 tokens)
- \(\text{overlap}\): Number of characters/tokens that consecutive chunks share at their boundaries (e.g., 100 characters or 20 tokens)
- \(\text{chunk\_size} - \text{overlap}\): Effective chunk size - the amount of new content each chunk adds (e.g., 500 - 100 = 400 characters of new content per chunk)
- \(\left\lceil \ldots \right\rceil\): Ceiling function - rounds up to the nearest integer (ensures partial chunks are counted)
Why Subtract Overlap?
If chunks overlap by 100 characters, then each chunk after the first only adds 400 new characters (500 - 100 = 400). The overlap is shared between chunks, so it doesn't count as "new" content for the chunk count calculation.
Example:
Document: 5,000 characters
Chunk size: 500 characters
Overlap: 100 characters
Effective chunk size: 500 - 100 = 400 characters
Number of chunks: \(\lceil 5000 / 400 \rceil = \lceil 12.5 \rceil = 13\) chunks
Chunk distribution:
- Chunk 1: characters 1-500
- Chunk 2: characters 401-900 (overlaps 401-500 with Chunk 1)
- Chunk 3: characters 801-1300 (overlaps 801-900 with Chunk 2)
- ... and so on
2. Overlap Percentage
What This Formula Measures:
This formula calculates what percentage of each chunk is shared with the next chunk. It helps you understand the trade-off between context preservation and storage efficiency.
Breaking It Down:
- \(\text{overlap}\): Number of overlapping characters/tokens between consecutive chunks
- \(\text{chunk\_size}\): Total size of each chunk
- \(\frac{\text{overlap}}{\text{chunk\_size}}\): Fraction of chunk that overlaps (e.g., 100/500 = 0.2 = 20%)
- \(\times 100\%\): Converts fraction to percentage
Typical Values:
- 10-20%: Standard overlap, good balance between context preservation and storage efficiency
- 5-10%: Low overlap, minimal storage overhead but less context preservation
- 20-30%: High overlap, better context preservation but significant storage increase
- 30%+: Very high overlap, rarely justified except for critical documents
Example:
Chunk size: 500 characters
Overlap: 100 characters
Overlap percentage: \(\frac{100}{500} \times 100\% = 20\%\)
This means 20% of each chunk (100 out of 500 characters) is shared with the next chunk, ensuring context is preserved at boundaries.
3. Storage Efficiency Ratio
What This Formula Measures:
This formula calculates how much storage space is needed for chunks compared to the original document. With overlap, you store more data than the original document (ratio > 1), but this preserves context at boundaries.
Breaking It Down:
- \(\text{total\_chunks}\): Number of chunks created from the document
- \(\text{chunk\_size}\): Size of each chunk (characters or tokens)
- \(\text{total\_chunks} \times \text{chunk\_size}\): Total storage needed for all chunks (includes overlap)
- \(\text{document\_length}\): Original document size
- Ratio: How many times more storage is needed compared to original
Interpreting the Ratio:
- Ratio = 1.0: No overlap, storage equals original document size (rare in practice)
- Ratio = 1.1-1.3: Moderate overlap (10-20%), typical for most RAG systems
- Ratio = 1.3-1.5: High overlap (20-30%), better context but more storage
- Ratio > 1.5: Very high overlap, usually not justified
Example:
Document: 10,000 characters
Chunk size: 500 characters
Overlap: 100 characters (20%)
Number of chunks: \(\lceil 10000 / (500-100) \rceil = \lceil 25 \rceil = 25\) chunks
Storage ratio: \(\frac{25 \times 500}{10000} = \frac{12500}{10000} = 1.25\)
This means you need 25% more storage than the original document, but you preserve context at all chunk boundaries.
Trade-off:
Higher storage ratio = better context preservation but more storage costs and embedding costs. Lower storage ratio = less storage but risk of losing context at boundaries.
4. Effective Chunk Size (New Content Per Chunk)
What This Represents:
The effective chunk size is the amount of new content each chunk adds, excluding the overlap that's shared with the previous chunk. This is what actually "advances" you through the document.
Breaking It Down:
- \(\text{chunk\_size}\): Total size of each chunk
- \(\text{overlap}\): Amount shared with previous chunk
- \(\text{chunk\_size} - \text{overlap}\): New content unique to this chunk
Why This Matters:
When calculating how many chunks you need, you use the effective chunk size, not the total chunk size, because overlap doesn't advance you through the document.
Example:
Chunk size: 500 characters
Overlap: 100 characters
Effective chunk size: 500 - 100 = 400 characters
This means each chunk after the first adds 400 new characters of content, while 100 characters are shared with the previous chunk for context.
5. Total Storage with Overlap
What This Calculates:
Total storage space needed to store all chunks, including overlap. This helps you estimate storage costs and embedding API costs.
Breaking It Down:
- First, calculate number of chunks using the chunk count formula
- Then multiply by chunk size to get total storage
- Result includes all overlap, so it's larger than the original document
Example:
Document: 20,000 characters
Chunk size: 1000 characters
Overlap: 200 characters (20%)
Number of chunks: \(\lceil 20000 / (1000-200) \rceil = \lceil 25 \rceil = 25\) chunks
Total storage: \(25 \times 1000 = 25,000\) characters
Original document: 20,000 characters
Storage overhead: 25,000 - 20,000 = 5,000 characters (25% increase)
Cost Implications:
If you're using a paid embedding API (e.g., OpenAI), you pay per token/character embedded. With 25% storage overhead, you pay 25% more for embeddings. This is usually worth it for better retrieval quality, but it's important to be aware of the cost.
Detailed Examples
Example 1: Fixed-Size Chunking - Detailed Walkthrough
Scenario: You have a long technical document that needs to be chunked for RAG.
Original Document (200 characters):
"Machine learning is a subset of artificial intelligence. It enables computers to learn from data without explicit programming. Deep learning uses neural networks with multiple layers. Natural language processing helps computers understand text."
Parameters:
- Chunk size: 80 characters
- Overlap: 20 characters (25% overlap)
- Effective chunk size: 80 - 20 = 60 new characters per chunk
Chunking Process:
Chunk 1 (characters 1-80):
"Machine learning is a subset of artificial intelligence. It enables computers to learn from data without"
Chunk 2 (characters 61-140, overlaps 61-80 with Chunk 1):
"from data without explicit programming. Deep learning uses neural networks with multiple layers. Natural"
Chunk 3 (characters 121-200, overlaps 121-140 with Chunk 2):
"multiple layers. Natural language processing helps computers understand text."
Analysis:
- Total chunks: 3
- Overlap preserved: "from data without" appears in both Chunk 1 and Chunk 2
- Overlap preserved: "multiple layers. Natural" appears in both Chunk 2 and Chunk 3
- ✅ Context at boundaries is preserved
Example 2: Sentence-Based Chunking - Preserving Semantic Coherence
Scenario: A narrative document where sentence boundaries matter for meaning.
Original Document:
"Python is a versatile programming language. It is widely used in data science and machine learning. Many libraries like NumPy and Pandas make Python powerful for data analysis. Machine learning frameworks such as scikit-learn and TensorFlow are built on Python."
Parameters:
- Chunk size: 3 sentences
- Overlap: 1 sentence
Sentence Identification:
- "Python is a versatile programming language."
- "It is widely used in data science and machine learning."
- "Many libraries like NumPy and Pandas make Python powerful for data analysis."
- "Machine learning frameworks such as scikit-learn and TensorFlow are built on Python."
Chunking Result:
Chunk 1 (sentences 1-3):
"Python is a versatile programming language. It is widely used in data science and machine learning. Many libraries like NumPy and Pandas make Python powerful for data analysis."
Chunk 2 (sentences 3-4, overlaps sentence 3):
"Many libraries like NumPy and Pandas make Python powerful for data analysis. Machine learning frameworks such as scikit-learn and TensorFlow are built on Python."
Advantages:
- ✅ Sentences remain intact (no mid-sentence breaks)
- ✅ Better semantic coherence (related sentences stay together)
- ✅ Overlap preserves context (sentence 3 appears in both chunks)
Example 3: Semantic Chunking - Topic-Based Boundaries
Scenario: A document with clear topic transitions that semantic chunking can identify.
Original Document:
"Machine learning algorithms learn patterns from data. Supervised learning uses labeled examples. Unsupervised learning finds patterns without labels. Reinforcement learning learns through trial and error. Neural networks are inspired by the brain. They consist of interconnected nodes called neurons. Deep learning uses networks with many layers."
Semantic Chunking Process:
Step 1: Embed each sentence
- Sentence 1: [0.45, -0.23, 0.67, ...] (about ML algorithms)
- Sentence 2: [0.48, -0.25, 0.65, ...] (about supervised learning)
- Sentence 3: [0.46, -0.24, 0.66, ...] (about unsupervised learning)
- Sentence 4: [0.47, -0.26, 0.64, ...] (about reinforcement learning)
- Sentence 5: [0.12, 0.34, -0.21, ...] (about neural networks - different topic!)
- Sentence 6: [0.13, 0.35, -0.20, ...] (about neural networks)
- Sentence 7: [0.14, 0.36, -0.19, ...] (about deep learning)
Step 2: Calculate similarity between consecutive sentences
- Sentences 1-4: High similarity (0.85-0.92) - all about ML learning types
- Sentence 4 to 5: Low similarity (0.35) - topic change from learning types to neural networks
- Sentences 5-7: High similarity (0.88-0.90) - all about neural networks
Step 3: Identify chunk boundaries
- Boundary detected between sentences 4 and 5 (similarity drops below threshold 0.5)
Resulting Chunks:
Chunk 1 (sentences 1-4): "Machine learning algorithms learn patterns from data. Supervised learning uses labeled examples. Unsupervised learning finds patterns without labels. Reinforcement learning learns through trial and error."
Topic: Types of machine learning
Chunk 2 (sentences 5-7): "Neural networks are inspired by the brain. They consist of interconnected nodes called neurons. Deep learning uses networks with many layers."
Topic: Neural networks and deep learning
Key Advantage: Semantic chunking automatically identified the topic boundary, creating chunks that represent complete concepts rather than arbitrary text splits.
Example 4: Recursive Chunking - Handling Nested Structures
Scenario: A structured document with chapters, sections, and paragraphs.
Document Structure:
Chapter 1: Introduction
Section 1.1: Overview (500 words)
Section 1.2: History (800 words)
Chapter 2: Methods
Section 2.1: Approach A (300 words)
Section 2.2: Approach B (400 words)
Recursive Chunking Process:
Level 1: Try chapters
- Chapter 1: 1,300 words (too large for 500-word chunks)
- Chapter 2: 700 words (too large)
- → Move to next level
Level 2: Try sections
- Section 1.1: 500 words (perfect size!) → Chunk 1
- Section 1.2: 800 words (too large) → Move to next level
- Section 2.1: 300 words (good size) → Chunk 2
- Section 2.2: 400 words (good size) → Chunk 3
Level 3: Try paragraphs (for Section 1.2)
- Paragraph 1: 200 words → Chunk 4
- Paragraph 2: 250 words → Chunk 5
- Paragraph 3: 350 words → Chunk 6
Final Result:
- 6 chunks total, each respecting document structure
- Chunks maintain hierarchy (metadata links chunks to their parent sections/chapters)
- ✅ Structure is preserved while meeting size constraints
Example 5: Impact of Chunk Size on Retrieval
Scenario: Same document chunked with different sizes to show how chunk size affects retrieval precision.
Document: "Machine learning uses algorithms. Supervised learning requires labeled data. Unsupervised learning finds patterns. Neural networks have multiple layers. Deep learning uses many layers."
Query: "What is supervised learning?"
Small Chunks (50 characters each):
- Chunk 1: "Machine learning uses algorithms. Supervised learning"
- Chunk 2: "Supervised learning requires labeled data. Unsupervised"
- Chunk 3: "Unsupervised learning finds patterns. Neural networks"
- Chunk 4: "Neural networks have multiple layers. Deep learning"
- Chunk 5: "Deep learning uses many layers."
Retrieval: Chunk 2 is retrieved (contains "supervised learning" and "labeled data")
Precision: High - chunk is highly relevant
Context: Limited - only mentions supervised learning briefly
Large Chunks (150 characters each):
- Chunk 1: "Machine learning uses algorithms. Supervised learning requires labeled data. Unsupervised learning finds patterns."
- Chunk 2: "Neural networks have multiple layers. Deep learning uses many layers."
Retrieval: Chunk 1 is retrieved (contains supervised learning)
Precision: Lower - chunk also contains unrelated info (unsupervised learning, algorithms)
Context: Rich - includes related concepts
Trade-off: Smaller chunks = more precise retrieval but less context. Larger chunks = more context but less precise retrieval. Choose based on your needs.
Implementation
Implementation Overview
This section provides practical Python code examples for implementing document processing and chunking in RAG systems. The examples use popular libraries like LangChain and NLTK for text processing, and demonstrate different chunking strategies with real-world scenarios.
1. Fixed-Size Chunking Implementation
What this does: Splits documents into chunks of fixed size (characters or tokens) with configurable overlap. Simple and fast, good for uniform content.
from langchain.text_splitter import CharacterTextSplitter
import re
class FixedSizeChunker:
"""
Fixed-size chunking implementation with overlap support.
This class splits documents into equal-sized chunks, preserving
overlap at boundaries to maintain context.
"""
def __init__(self, chunk_size=500, chunk_overlap=100):
"""
Initialize chunker with size and overlap parameters.
Args:
chunk_size: Size of each chunk in characters
chunk_overlap: Number of overlapping characters between chunks
"""
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
self.effective_size = chunk_size - chunk_overlap
def chunk(self, text):
"""
Split text into fixed-size chunks with overlap.
Args:
text: Input text to chunk
Returns:
List of chunk strings
"""
if len(text) <= self.chunk_size:
return [text] # Text fits in one chunk
chunks = []
start = 0
while start < len(text):
# Calculate end position
end = start + self.chunk_size
# Extract chunk
chunk = text[start:end]
chunks.append(chunk)
# Move start position (accounting for overlap)
start += self.effective_size
return chunks
def chunk_with_metadata(self, text, metadata=None):
"""
Chunk text and preserve metadata for each chunk.
Args:
text: Input text
metadata: Dictionary of metadata (e.g., {'title': 'Doc1', 'author': 'John'})
Returns:
List of dictionaries with 'text' and 'metadata' keys
"""
chunks = self.chunk(text)
chunked_docs = []
for i, chunk_text in enumerate(chunks):
chunk_metadata = {
**(metadata or {}),
'chunk_index': i,
'total_chunks': len(chunks)
}
chunked_docs.append({
'text': chunk_text,
'metadata': chunk_metadata
})
return chunked_docs
# Example usage
chunker = FixedSizeChunker(chunk_size=100, chunk_overlap=20)
document = "Machine learning is a subset of artificial intelligence. " \
"It enables computers to learn from data without explicit programming. " \
"Deep learning uses neural networks with multiple layers."
chunks = chunker.chunk(document)
print(f"Number of chunks: {len(chunks)}")
for i, chunk in enumerate(chunks, 1):
print(f"\nChunk {i} ({len(chunk)} chars): {chunk}")
# Output:
# Number of chunks: 2
# Chunk 1 (100 chars): Machine learning is a subset of artificial intelligence. It enables computers to learn from data without explicit
# Chunk 2 (100 chars): from data without explicit programming. Deep learning uses neural networks with multiple layers.
# Note: "from data without explicit" appears in both chunks (overlap)
Key Points:
- Simple implementation: Easy to understand and implement
- Overlap handling: Each chunk after the first starts `chunk_size - overlap` characters into the previous chunk
- Metadata preservation: Can attach metadata to each chunk for filtering and organization
- Use case: Good for uniform text where structure doesn't matter
2. Sentence-Based Chunking Implementation
What this does: Splits documents at sentence boundaries, grouping multiple sentences into chunks. Preserves sentence integrity and improves semantic coherence.
import nltk
from nltk.tokenize import sent_tokenize
import re
# Download required NLTK data (run once)
# nltk.download('punkt')
class SentenceBasedChunker:
"""
Sentence-based chunking that respects sentence boundaries.
Groups sentences into chunks of approximately target size,
ensuring no sentence is split across chunks.
"""
def __init__(self, chunk_size=500, chunk_overlap=100):
"""
Initialize sentence-based chunker.
Args:
chunk_size: Target chunk size in characters
chunk_overlap: Overlap in characters (applied at sentence boundaries)
"""
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
def chunk(self, text):
"""
Split text into sentence-based chunks.
Args:
text: Input text to chunk
Returns:
List of chunk strings
"""
# Split into sentences
sentences = sent_tokenize(text)
if not sentences:
return [text]
chunks = []
current_chunk = []
current_size = 0
for sentence in sentences:
sentence_size = len(sentence)
# If adding this sentence would exceed chunk size, finalize current chunk
if current_size + sentence_size > self.chunk_size and current_chunk:
# Join sentences into chunk
chunk_text = ' '.join(current_chunk)
chunks.append(chunk_text)
# Start new chunk with overlap (last few sentences)
overlap_sentences = self._get_overlap_sentences(current_chunk)
current_chunk = overlap_sentences + [sentence]
current_size = sum(len(s) for s in current_chunk)
else:
current_chunk.append(sentence)
current_size += sentence_size
# Add final chunk
if current_chunk:
chunks.append(' '.join(current_chunk))
return chunks
def _get_overlap_sentences(self, sentences):
"""
Get sentences for overlap based on target overlap size.
Args:
sentences: List of sentences in current chunk
Returns:
List of sentences to include in next chunk for overlap
"""
overlap_sentences = []
overlap_size = 0
# Add sentences from the end until we reach overlap size
for sentence in reversed(sentences):
if overlap_size + len(sentence) <= self.chunk_overlap:
overlap_sentences.insert(0, sentence)
overlap_size += len(sentence)
else:
break
return overlap_sentences
# Example usage
chunker = SentenceBasedChunker(chunk_size=150, chunk_overlap=30)
document = "Python is a versatile programming language. " \
"It is widely used in data science and machine learning. " \
"Many libraries like NumPy and Pandas make Python powerful. " \
"Machine learning frameworks such as scikit-learn are built on Python."
chunks = chunker.chunk(document)
print(f"Number of chunks: {len(chunks)}")
for i, chunk in enumerate(chunks, 1):
print(f"\nChunk {i} ({len(chunk)} chars):")
print(chunk)
# Output:
# Number of chunks: 2
# Chunk 1: Python is a versatile programming language. It is widely used in data science and machine learning.
# Chunk 2: It is widely used in data science and machine learning. Many libraries like NumPy and Pandas make Python powerful.
# Note: Overlap preserves context at sentence boundaries
Key Points:
- Sentence preservation: Never breaks sentences, maintaining grammatical integrity
- Smart overlap: Overlap is applied at sentence boundaries, not arbitrary character positions
- Better semantics: Related sentences stay together, improving embedding quality
- Use case: Ideal for narrative text, articles, and prose where sentence structure matters
3. Recursive Chunking Implementation (LangChain)
What this does: Tries multiple chunking strategies in hierarchy (paragraphs → sentences → characters) until chunks fit size requirements. Handles nested document structures intelligently.
from langchain.text_splitter import RecursiveCharacterTextSplitter
class DocumentChunker:
"""
Recursive chunking that tries multiple separators in order.
This is the most robust chunking approach, handling various
document structures automatically.
"""
def __init__(self, chunk_size=1000, chunk_overlap=200):
"""
Initialize recursive chunker.
Args:
chunk_size: Target chunk size in characters
chunk_overlap: Overlap between chunks
"""
self.splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=len,
# Try these separators in order (most to least preferred)
separators=[
"\n\n", # Paragraphs (double newline)
"\n", # Lines (single newline)
". ", # Sentences (period + space)
" ", # Words (spaces)
"" # Characters (fallback)
]
)
def chunk_document(self, text, metadata=None):
"""
Chunk document with metadata preservation.
Args:
text: Document text
metadata: Optional metadata dictionary
Returns:
List of chunk dictionaries with text and metadata
"""
# Split into chunks
chunks = self.splitter.split_text(text)
# Add metadata to each chunk
chunked_docs = []
for i, chunk_text in enumerate(chunks):
chunk_metadata = {
**(metadata or {}),
'chunk_index': i,
'total_chunks': len(chunks),
'chunk_size': len(chunk_text)
}
chunked_docs.append({
'text': chunk_text,
'metadata': chunk_metadata
})
return chunked_docs
def chunk_multiple_documents(self, documents):
"""
Chunk multiple documents, preserving document-level metadata.
Args:
documents: List of dicts with 'text' and optional 'metadata'
Returns:
List of all chunks with preserved metadata
"""
all_chunks = []
for doc_idx, doc in enumerate(documents):
text = doc.get('text', '')
metadata = doc.get('metadata', {})
metadata['document_index'] = doc_idx
chunks = self.chunk_document(text, metadata)
all_chunks.extend(chunks)
return all_chunks
# Example usage
chunker = DocumentChunker(chunk_size=200, chunk_overlap=50)
# Example: Structured document with paragraphs
document = """Introduction
Machine learning is transforming industries. It enables computers to learn from data.
Methods
We use neural networks for pattern recognition. Deep learning models achieve state-of-the-art results.
Conclusion
The future of AI looks promising. Machine learning will continue to evolve."""
chunks = chunker.chunk_document(
document,
metadata={'title': 'ML Overview', 'author': 'John Doe', 'date': '2024-01-15'}
)
print(f"Number of chunks: {len(chunks)}")
for chunk in chunks:
print(f"\nChunk {chunk['metadata']['chunk_index'] + 1}:")
print(f"Text: {chunk['text'][:100]}...")
print(f"Metadata: {chunk['metadata']}")
# The recursive splitter will:
# 1. Try splitting by "\n\n" (paragraphs) - succeeds for this document
# 2. If paragraphs are too large, try "\n" (lines)
# 3. If lines are too large, try ". " (sentences)
# 4. And so on...
Key Points:
- Adaptive: Automatically chooses the best separator based on document structure
- Hierarchical: Respects document hierarchy (paragraphs → sentences → words)
- Robust: Handles various document formats (markdown, plain text, structured)
- Metadata preservation: Maintains document-level and chunk-level metadata
- Use case: Best for diverse document types or when you're unsure of document structure
4. Complete Document Processing Pipeline
What this does: A complete implementation that loads documents from various formats, processes them, chunks them, and prepares them for embedding and storage in a vector database.
import os
from pathlib import Path
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, TextLoader, UnstructuredHTMLLoader
class DocumentProcessor:
"""
Complete document processing pipeline for RAG systems.
Handles loading, cleaning, chunking, and metadata extraction
from various document formats.
"""
def __init__(self, chunk_size=1000, chunk_overlap=200):
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
self.splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=len
)
def load_document(self, file_path):
"""
Load document from file based on extension.
Args:
file_path: Path to document file
Returns:
Document text and metadata
"""
file_path = Path(file_path)
extension = file_path.suffix.lower()
# Extract base metadata
metadata = {
'source': str(file_path),
'filename': file_path.name,
'file_type': extension
}
# Load based on file type
if extension == '.pdf':
loader = PyPDFLoader(str(file_path))
pages = loader.load()
text = '\n\n'.join([page.page_content for page in pages])
metadata['page_count'] = len(pages)
elif extension in ['.txt', '.md']:
loader = TextLoader(str(file_path), encoding='utf-8')
document = loader.load()
text = document[0].page_content
elif extension in ['.html', '.htm']:
loader = UnstructuredHTMLLoader(str(file_path))
document = loader.load()
text = document[0].page_content
else:
raise ValueError(f"Unsupported file type: {extension}")
return text, metadata
def process_document(self, file_path):
"""
Complete processing: load, chunk, and prepare for embedding.
Args:
file_path: Path to document file
Returns:
List of chunk dictionaries ready for embedding
"""
# Step 1: Load document
text, doc_metadata = self.load_document(file_path)
# Step 2: Clean text (remove excessive whitespace, normalize)
text = self._clean_text(text)
# Step 3: Chunk document
chunks = self.splitter.split_text(text)
# Step 4: Create chunk documents with metadata
chunked_docs = []
for i, chunk_text in enumerate(chunks):
chunk_metadata = {
**doc_metadata,
'chunk_index': i,
'total_chunks': len(chunks),
'chunk_size': len(chunk_text)
}
chunked_docs.append({
'text': chunk_text,
'metadata': chunk_metadata
})
return chunked_docs
def _clean_text(self, text):
"""Clean and normalize text."""
# Remove excessive whitespace
text = re.sub(r'\s+', ' ', text)
# Remove special characters that might interfere
text = text.strip()
return text
def process_directory(self, directory_path, file_patterns=None):
"""
Process all documents in a directory.
Args:
directory_path: Path to directory containing documents
file_patterns: List of file patterns to include (e.g., ['*.pdf', '*.txt'])
Returns:
List of all chunks from all documents
"""
directory = Path(directory_path)
all_chunks = []
# Default patterns
if file_patterns is None:
file_patterns = ['*.pdf', '*.txt', '*.md', '*.html']
# Find all matching files
files = []
for pattern in file_patterns:
files.extend(directory.glob(pattern))
# Process each file
for file_path in files:
try:
chunks = self.process_document(file_path)
all_chunks.extend(chunks)
print(f"Processed {file_path.name}: {len(chunks)} chunks")
except Exception as e:
print(f"Error processing {file_path}: {e}")
return all_chunks
# Example usage
processor = DocumentProcessor(chunk_size=500, chunk_overlap=100)
# Process a single document
chunks = processor.process_document('document.pdf')
print(f"Created {len(chunks)} chunks from document")
# Process entire directory
all_chunks = processor.process_directory('./documents/', file_patterns=['*.pdf', '*.txt'])
print(f"Total chunks from all documents: {len(all_chunks)}")
# Now chunks are ready for:
# 1. Embedding generation
# 2. Storage in vector database
# 3. Retrieval in RAG system
Complete Pipeline Steps:
- Document Loading: Loads from PDF, TXT, MD, HTML formats
- Text Cleaning: Normalizes whitespace and removes artifacts
- Chunking: Splits into manageable chunks with overlap
- Metadata Extraction: Preserves document properties (filename, type, page count)
- Ready for Embedding: Chunks are prepared for vector database storage
Installation Requirements
To run these examples, install the required packages:
pip install langchain nltk pypdf unstructured
For NLTK sentence tokenization, download the punkt tokenizer (run once):
import nltk
nltk.download('punkt')
Real-World Applications
Retrieval Strategy Selection
Use dense retrieval when:
- Semantic understanding is important
- Users may phrase queries differently
- Domain-specific terminology
Use sparse retrieval when:
- Exact keyword matching is important
- Speed is critical
- Technical documentation with specific terms
Use hybrid when:
- You want best of both worlds
- High accuracy is required
- Can afford extra computation
Reranking Benefits
When to use reranking:
- Initial retrieval returns many candidates
- Need high precision in top results
- Can afford additional latency
- Quality is more important than speed