Chapter 3: Document Processing & Chunking

Preparing Documents for Retrieval

Learning Objectives

Understand document processing & chunking fundamentals
Master the mathematical foundations
Learn practical implementation
Apply knowledge through examples
Recognize real-world applications

Document Processing & Chunking

Why Document Processing is Critical in RAG

Before you can retrieve relevant information, you need to prepare your documents. Real-world documents come in many formats (PDFs, Word docs, HTML pages, markdown files, databases) and sizes (from short paragraphs to entire books with thousands of pages). Document processing and chunking is the crucial first step that transforms raw documents into a format that RAG systems can effectively search and retrieve.

The fundamental challenge: Most documents are too large to process as a single unit. LLMs have context window limits (typically 4K-128K tokens), and embedding models work best with text segments of 128-512 tokens. A single research paper might be 10,000+ words, and a technical manual could be hundreds of pages. You can't embed or process these as single units.

What Document Processing & Chunking Involves

Document Loading: Extract text from various formats (PDF, HTML, Word, etc.) while preserving structure and metadata
Text Cleaning: Remove formatting artifacts, handle special characters, normalize whitespace
Chunking: Split large documents into smaller, semantically meaningful pieces (chunks) that fit within context windows
Metadata Extraction: Capture document properties (title, author, date, section, category) for filtering and organization
Context Preservation: Maintain relationships between chunks (overlap, hierarchy) so retrieved chunks have sufficient context

Real-World Example:

Imagine you have a 200-page technical manual about machine learning. Without proper processing:

❌ The entire document is too large to embed meaningfully (loses semantic precision)
❌ Retrieval returns the whole manual even for specific questions (wastes tokens, poor accuracy)
❌ LLM struggles to find relevant information in 200 pages of text

With proper chunking:

✅ Manual is split into 500 focused chunks (one per section/topic)
✅ Each chunk is embedded separately and stored in vector database
✅ Query about "gradient descent" retrieves only the 2-3 most relevant chunks
✅ LLM receives focused, relevant context and generates accurate answers

Key Concepts You'll Learn

Chunking Strategies: Fixed-size, sentence-based, semantic, and recursive chunking - when to use each and why
Chunk Overlap: Why overlapping chunks preserve context at boundaries and how to choose the right overlap percentage
Optimal Chunk Size: Balancing retrieval precision (smaller chunks) with context completeness (larger chunks)
Document Type Handling: Processing PDFs, HTML, markdown, and structured documents with appropriate parsers
Metadata Management: Extracting and storing document properties for filtering and organization
Context Preservation: Techniques to maintain semantic relationships between chunks

Why this matters: Poor chunking leads to poor retrieval. If chunks are too large, retrieval is imprecise. If chunks are too small, context is lost. If chunks break at wrong boundaries, semantic meaning is destroyed. Getting chunking right is foundational to RAG system performance.

Key Concepts

Why Chunking is Critical for RAG Systems

The Core Problem

Raw documents in real-world RAG systems are often extremely long—think research papers (10,000+ words), legal documents (hundreds of pages), technical documentation (thousands of sections), or entire books. These documents present several fundamental challenges:

Context Window Limits: Most LLMs have fixed context windows (e.g., GPT-4: 8K-128K tokens, Claude: 100K-200K tokens). A single large document can easily exceed these limits, making it impossible to process the entire document at once.
Embedding Model Constraints: Embedding models like SentenceTransformers work best with text segments of 128-512 tokens. Very long documents produce embeddings that lose semantic precision—the model struggles to capture the meaning of a 10,000-word document in a single 384-dimensional vector.
Retrieval Precision: When you retrieve a 50-page document for a specific question, most of that document is irrelevant. The LLM has to sift through thousands of words to find the answer, leading to poor performance and high costs.
Computational Efficiency: Processing entire documents is computationally expensive. Smaller chunks allow for faster embedding generation, more efficient storage, and quicker retrieval.

The Solution: Intelligent Chunking

Chunking splits large documents into smaller, manageable pieces that:

Fit within context limits: Each chunk is small enough to fit comfortably in the LLM's context window, even when combined with the query and other chunks.
Are semantically meaningful: Each chunk represents a coherent unit of information (a paragraph, a section, a concept) rather than arbitrary text splits.
Can be retrieved independently: Each chunk can be embedded and stored separately, allowing the retrieval system to find the most relevant chunk(s) for a specific query.
Maintain context when possible: Chunks preserve surrounding context (through overlap or metadata) so the LLM understands the broader context when generating answers.

Real-World Example:

Imagine you have a 200-page technical manual about machine learning. Without chunking:

❌ The entire document is too large to embed meaningfully
❌ Retrieval returns the whole manual, even if the question is about a specific algorithm
❌ The LLM wastes tokens processing irrelevant sections

With intelligent chunking:

✅ The manual is split into 500 chunks (one per section/topic)
✅ Each chunk is embedded separately and stored in the vector database
✅ A query about "gradient descent" retrieves only the 2-3 most relevant chunks
✅ The LLM receives focused, relevant context and generates accurate answers

Chunking Strategies: Choosing the Right Approach

Different chunking strategies serve different purposes. The choice depends on your document type, use case, and performance requirements.

1. Fixed-Size Chunking

What it is: Splits documents into chunks of a fixed size (measured in characters or tokens), regardless of content structure.

How it works:

Divide the document into equal-sized segments (e.g., 500 characters or 200 tokens)
Each chunk has exactly the same size (except possibly the last chunk)
No consideration for sentence boundaries, paragraphs, or semantic meaning

Advantages:

✅ Simple and fast: Very easy to implement, no complex logic needed
✅ Predictable: You know exactly how many chunks you'll get for any document size
✅ Efficient storage: Uniform chunk sizes make storage and indexing straightforward
✅ Good for uniform content: Works well when documents have consistent structure

Disadvantages:

❌ May break sentences: A chunk might end mid-sentence, losing meaning
❌ Ignores semantic boundaries: A single concept might be split across two chunks
❌ Poor for structured content: Doesn't respect paragraphs, sections, or logical divisions

When to use: When you have uniform, unstructured text where semantic boundaries don't matter much, or when you need maximum speed and simplicity.

2. Sentence-Based Chunking

What it is: Splits documents at sentence boundaries, grouping multiple sentences into chunks of roughly equal size.

How it works:

Identify sentence boundaries using NLP tools (NLTK, spaCy, or regex)
Group sentences together until reaching a target chunk size (e.g., 5-10 sentences or 200-500 tokens)
Each chunk contains complete sentences, never breaking mid-sentence

Advantages:

✅ Preserves sentence integrity: Sentences remain intact, maintaining grammatical and semantic coherence
✅ Better semantic coherence: Related sentences stay together, improving embedding quality
✅ Respects natural boundaries: Works with how humans structure information
✅ Good for narrative content: Excellent for articles, stories, and prose

Disadvantages:

❌ Variable chunk sizes: Chunks may vary significantly in size depending on sentence length
❌ May split related concepts: A concept spanning multiple sentences might be split across chunks
❌ Requires sentence detection: Needs reliable sentence segmentation (can fail with abbreviations, decimals, etc.)

When to use: For narrative text, articles, blog posts, or any content where sentence boundaries matter. This is often the default choice for general-purpose RAG systems.

3. Semantic Chunking (Advanced)

What it is: Uses embeddings and similarity calculations to group semantically related sentences together, creating chunks based on meaning rather than size.

How it works:

Embed each sentence (or small group of sentences) into vector space
Calculate similarity between consecutive sentences
When similarity drops below a threshold, that's a chunk boundary (new topic/concept)
Group similar sentences together until reaching a maximum chunk size

Advantages:

✅ Most semantically coherent: Chunks represent complete concepts or topics
✅ Adaptive to content: Automatically adjusts to document structure
✅ Better retrieval quality: Chunks are more likely to be fully relevant or fully irrelevant
✅ Respects topic boundaries: Natural breaks occur at topic transitions

Disadvantages:

❌ Computationally expensive: Requires embedding every sentence, then calculating similarities
❌ More complex to implement: Needs careful tuning of similarity thresholds
❌ Variable chunk sizes: Can produce very small or very large chunks
❌ Requires embedding model: Needs a good sentence embedding model to work well

When to use: For high-quality RAG systems where retrieval precision matters more than speed. Ideal for technical documentation, research papers, or any content with clear topic boundaries.

4. Recursive Chunking (Hierarchical)

What it is: A hybrid approach that tries multiple chunking strategies in a hierarchy (e.g., try paragraphs first, then sentences, then fixed-size).

How it works:

First, try to split by paragraphs (if they exist and are reasonable size)
If paragraphs are too large, split by sentences
If sentences are still too large, use fixed-size chunking as fallback
Maintains hierarchy: parent chunks contain metadata about child chunks

Advantages:

✅ Adaptive: Automatically chooses the best strategy for each part of the document
✅ Respects structure: Uses document structure when available
✅ Robust: Falls back gracefully when structure is missing

When to use: When you have diverse document types with varying structures. Popular in production RAG systems (used by LangChain, LlamaIndex).

Chunk Overlap: Preserving Context at Boundaries

Why Overlap is Essential

When you split a document into chunks, you create boundaries between chunks. These boundaries are artificial—they don't exist in the original document. This creates a critical problem:

The Boundary Problem: Important information often appears at the edges of chunks. Consider this example:

Example: The Boundary Problem

Original Document:

"Machine learning models require careful tuning. Hyperparameters like learning rate and batch size significantly impact model performance. Regularization techniques help prevent overfitting."

Without Overlap (Bad):

Chunk 1: "Machine learning models require careful tuning. Hyperparameters like learning rate and batch size significantly impact model performance."
Chunk 2: "Regularization techniques help prevent overfitting."

❌ If a query asks about "hyperparameters and regularization," Chunk 1 might be retrieved (mentions hyperparameters) but Chunk 2 (mentions regularization) might not be, even though they're related concepts.

With Overlap (Good):

Chunk 1: "Machine learning models require careful tuning. Hyperparameters like learning rate and batch size significantly impact model performance. Regularization techniques help prevent overfitting."
Chunk 2: "Hyperparameters like learning rate and batch size significantly impact model performance. Regularization techniques help prevent overfitting."

✅ Now both chunks contain information about hyperparameters AND regularization, improving retrieval quality.

How Overlap Works

Overlap means that consecutive chunks share some content at their boundaries. For example, with 20% overlap:

If chunk size is 500 tokens, overlap is 100 tokens
Chunk 1: tokens 1-500
Chunk 2: tokens 401-900 (starts at token 401, overlapping the last 100 tokens of Chunk 1)
Chunk 3: tokens 801-1300 (overlaps with Chunk 2)

Choosing the Right Overlap

Typical overlap: 10-20% of chunk size is standard. Here's why:

Too little overlap (0-5%): ❌ Doesn't solve the boundary problem. Important context at boundaries is still lost. Not recommended.
Moderate overlap (10-20%): ✅ Good balance. Preserves context without excessive storage overhead. This is the sweet spot for most use cases.
High overlap (30-50%): ⚠️ Better context preservation but significantly increases storage costs. Use only when context preservation is critical (e.g., legal documents, medical records).
Very high overlap (50%+): ❌ Wasteful. You're essentially storing the document twice. Rarely justified.

Trade-offs

Benefits of overlap:

✅ Preserves context at boundaries
✅ Improves retrieval quality (related concepts stay together)
✅ Reduces risk of missing relevant information
✅ Better for queries that span multiple topics

Costs of overlap:

❌ Increased storage: More chunks = more embeddings to store
❌ Higher embedding costs: More chunks to embed (if using paid APIs)
❌ Potential redundancy: Same information retrieved multiple times (though this is usually acceptable)

Best Practices

Start with 10-20% overlap: This works well for most documents
Increase for critical documents: Use 20-30% for legal, medical, or financial documents where context is crucial
Decrease for uniform content: Use 5-10% for structured data or lists where boundaries are less important
Test and measure: Evaluate retrieval quality with different overlap percentages on your specific documents

Mathematical Formulations

Chunking Mathematics Overview

Chunking involves several mathematical considerations: calculating the number of chunks needed, determining overlap, and understanding storage efficiency. These formulas help you make informed decisions about chunk size and overlap for optimal RAG performance.

1. Chunk Count Calculation

\[\text{num\_chunks} = \left\lceil \frac{\text{document\_length}}{\text{chunk\_size} - \text{overlap}} \right\rceil\]

What This Formula Calculates:

This formula determines how many chunks you'll get when splitting a document of a given length into chunks of a specified size with overlap. The ceiling function (\(\lceil \rceil\)) ensures you round up to include any partial chunk at the end.

Breaking It Down:

\(\text{document\_length}\): Total length of the document measured in characters or tokens (e.g., 10,000 characters or 2,500 tokens)
\(\text{chunk\_size}\): Desired size of each chunk (e.g., 500 characters or 200 tokens)
\(\text{overlap}\): Number of characters/tokens that consecutive chunks share at their boundaries (e.g., 100 characters or 20 tokens)
\(\text{chunk\_size} - \text{overlap}\): Effective chunk size - the amount of new content each chunk adds (e.g., 500 - 100 = 400 characters of new content per chunk)
\(\left\lceil \ldots \right\rceil\): Ceiling function - rounds up to the nearest integer (ensures partial chunks are counted)

Why Subtract Overlap?

If chunks overlap by 100 characters, then each chunk after the first only adds 400 new characters (500 - 100 = 400). The overlap is shared between chunks, so it doesn't count as "new" content for the chunk count calculation.

Example:

Document: 5,000 characters
Chunk size: 500 characters
Overlap: 100 characters
Effective chunk size: 500 - 100 = 400 characters
Number of chunks: \(\lceil 5000 / 400 \rceil = \lceil 12.5 \rceil = 13\) chunks

Chunk distribution:

Chunk 1: characters 1-500
Chunk 2: characters 401-900 (overlaps 401-500 with Chunk 1)
Chunk 3: characters 801-1300 (overlaps 801-900 with Chunk 2)
... and so on

2. Overlap Percentage

\[\text{overlap\_percentage} = \frac{\text{overlap}}{\text{chunk\_size}} \times 100\%\]

What This Formula Measures:

This formula calculates what percentage of each chunk is shared with the next chunk. It helps you understand the trade-off between context preservation and storage efficiency.

Breaking It Down:

\(\text{overlap}\): Number of overlapping characters/tokens between consecutive chunks
\(\text{chunk\_size}\): Total size of each chunk
\(\frac{\text{overlap}}{\text{chunk\_size}}\): Fraction of chunk that overlaps (e.g., 100/500 = 0.2 = 20%)
\(\times 100\%\): Converts fraction to percentage

Typical Values:

10-20%: Standard overlap, good balance between context preservation and storage efficiency
5-10%: Low overlap, minimal storage overhead but less context preservation
20-30%: High overlap, better context preservation but significant storage increase
30%+: Very high overlap, rarely justified except for critical documents

Example:

Chunk size: 500 characters
Overlap: 100 characters
Overlap percentage: \(\frac{100}{500} \times 100\% = 20\%\)

This means 20% of each chunk (100 out of 500 characters) is shared with the next chunk, ensuring context is preserved at boundaries.

3. Storage Efficiency Ratio

\[\text{storage\_ratio} = \frac{\text{total\_chunks} \times \text{chunk\_size}}{\text{document\_length}}\]

What This Formula Measures:

This formula calculates how much storage space is needed for chunks compared to the original document. With overlap, you store more data than the original document (ratio > 1), but this preserves context at boundaries.

Breaking It Down:

\(\text{total\_chunks}\): Number of chunks created from the document
\(\text{chunk\_size}\): Size of each chunk (characters or tokens)
\(\text{total\_chunks} \times \text{chunk\_size}\): Total storage needed for all chunks (includes overlap)
\(\text{document\_length}\): Original document size
Ratio: How many times more storage is needed compared to original

Interpreting the Ratio:

Ratio = 1.0: No overlap, storage equals original document size (rare in practice)
Ratio = 1.1-1.3: Moderate overlap (10-20%), typical for most RAG systems
Ratio = 1.3-1.5: High overlap (20-30%), better context but more storage
Ratio > 1.5: Very high overlap, usually not justified

Example:

Document: 10,000 characters
Chunk size: 500 characters
Overlap: 100 characters (20%)
Number of chunks: \(\lceil 10000 / (500-100) \rceil = \lceil 25 \rceil = 25\) chunks
Storage ratio: \(\frac{25 \times 500}{10000} = \frac{12500}{10000} = 1.25\)

This means you need 25% more storage than the original document, but you preserve context at all chunk boundaries.

Trade-off:

Higher storage ratio = better context preservation but more storage costs and embedding costs. Lower storage ratio = less storage but risk of losing context at boundaries.

4. Effective Chunk Size (New Content Per Chunk)

\[\text{effective\_chunk\_size} = \text{chunk\_size} - \text{overlap}\]

What This Represents:

The effective chunk size is the amount of new content each chunk adds, excluding the overlap that's shared with the previous chunk. This is what actually "advances" you through the document.

Breaking It Down:

\(\text{chunk\_size}\): Total size of each chunk
\(\text{overlap}\): Amount shared with previous chunk
\(\text{chunk\_size} - \text{overlap}\): New content unique to this chunk

Why This Matters:

When calculating how many chunks you need, you use the effective chunk size, not the total chunk size, because overlap doesn't advance you through the document.

Example:

Chunk size: 500 characters
Overlap: 100 characters
Effective chunk size: 500 - 100 = 400 characters

This means each chunk after the first adds 400 new characters of content, while 100 characters are shared with the previous chunk for context.

5. Total Storage with Overlap

\[\text{total\_storage} = \text{num\_chunks} \times \text{chunk\_size} = \left\lceil \frac{\text{document\_length}}{\text{chunk\_size} - \text{overlap}} \right\rceil \times \text{chunk\_size}\]

What This Calculates:

Total storage space needed to store all chunks, including overlap. This helps you estimate storage costs and embedding API costs.

Breaking It Down:

First, calculate number of chunks using the chunk count formula
Then multiply by chunk size to get total storage
Result includes all overlap, so it's larger than the original document

Example:

Document: 20,000 characters
Chunk size: 1000 characters
Overlap: 200 characters (20%)
Number of chunks: \(\lceil 20000 / (1000-200) \rceil = \lceil 25 \rceil = 25\) chunks
Total storage: \(25 \times 1000 = 25,000\) characters

Original document: 20,000 characters
Storage overhead: 25,000 - 20,000 = 5,000 characters (25% increase)

Cost Implications:

If you're using a paid embedding API (e.g., OpenAI), you pay per token/character embedded. With 25% storage overhead, you pay 25% more for embeddings. This is usually worth it for better retrieval quality, but it's important to be aware of the cost.

Detailed Examples

Example 1: Fixed-Size Chunking - Detailed Walkthrough

Scenario: You have a long technical document that needs to be chunked for RAG.

Original Document (200 characters):

"Machine learning is a subset of artificial intelligence. It enables computers to learn from data without explicit programming. Deep learning uses neural networks with multiple layers. Natural language processing helps computers understand text."

Parameters:

Chunk size: 80 characters
Overlap: 20 characters (25% overlap)
Effective chunk size: 80 - 20 = 60 new characters per chunk

Chunking Process:

Chunk 1 (characters 1-80):
"Machine learning is a subset of artificial intelligence. It enables computers to learn from data without"

Chunk 2 (characters 61-140, overlaps 61-80 with Chunk 1):
"from data without explicit programming. Deep learning uses neural networks with multiple layers. Natural"

Chunk 3 (characters 121-200, overlaps 121-140 with Chunk 2):
"multiple layers. Natural language processing helps computers understand text."

Analysis:

Total chunks: 3
Overlap preserved: "from data without" appears in both Chunk 1 and Chunk 2
Overlap preserved: "multiple layers. Natural" appears in both Chunk 2 and Chunk 3
✅ Context at boundaries is preserved

Example 2: Sentence-Based Chunking - Preserving Semantic Coherence

Scenario: A narrative document where sentence boundaries matter for meaning.

Original Document:

"Python is a versatile programming language. It is widely used in data science and machine learning. Many libraries like NumPy and Pandas make Python powerful for data analysis. Machine learning frameworks such as scikit-learn and TensorFlow are built on Python."

Parameters:

Chunk size: 3 sentences
Overlap: 1 sentence

Sentence Identification:

"Python is a versatile programming language."
"It is widely used in data science and machine learning."
"Many libraries like NumPy and Pandas make Python powerful for data analysis."
"Machine learning frameworks such as scikit-learn and TensorFlow are built on Python."

Chunking Result:

Chunk 1 (sentences 1-3):
"Python is a versatile programming language. It is widely used in data science and machine learning. Many libraries like NumPy and Pandas make Python powerful for data analysis."

Chunk 2 (sentences 3-4, overlaps sentence 3):
"Many libraries like NumPy and Pandas make Python powerful for data analysis. Machine learning frameworks such as scikit-learn and TensorFlow are built on Python."

Advantages:

✅ Sentences remain intact (no mid-sentence breaks)
✅ Better semantic coherence (related sentences stay together)
✅ Overlap preserves context (sentence 3 appears in both chunks)

Example 3: Semantic Chunking - Topic-Based Boundaries

Scenario: A document with clear topic transitions that semantic chunking can identify.

Original Document:

"Machine learning algorithms learn patterns from data. Supervised learning uses labeled examples. Unsupervised learning finds patterns without labels. Reinforcement learning learns through trial and error. Neural networks are inspired by the brain. They consist of interconnected nodes called neurons. Deep learning uses networks with many layers."

Semantic Chunking Process:

Step 1: Embed each sentence

Sentence 1: [0.45, -0.23, 0.67, ...] (about ML algorithms)
Sentence 2: [0.48, -0.25, 0.65, ...] (about supervised learning)
Sentence 3: [0.46, -0.24, 0.66, ...] (about unsupervised learning)
Sentence 4: [0.47, -0.26, 0.64, ...] (about reinforcement learning)
Sentence 5: [0.12, 0.34, -0.21, ...] (about neural networks - different topic!)
Sentence 6: [0.13, 0.35, -0.20, ...] (about neural networks)
Sentence 7: [0.14, 0.36, -0.19, ...] (about deep learning)

Step 2: Calculate similarity between consecutive sentences

Sentences 1-4: High similarity (0.85-0.92) - all about ML learning types
Sentence 4 to 5: Low similarity (0.35) - topic change from learning types to neural networks
Sentences 5-7: High similarity (0.88-0.90) - all about neural networks

Step 3: Identify chunk boundaries

Boundary detected between sentences 4 and 5 (similarity drops below threshold 0.5)

Resulting Chunks:

Chunk 1 (sentences 1-4): "Machine learning algorithms learn patterns from data. Supervised learning uses labeled examples. Unsupervised learning finds patterns without labels. Reinforcement learning learns through trial and error."
Topic: Types of machine learning

Chunk 2 (sentences 5-7): "Neural networks are inspired by the brain. They consist of interconnected nodes called neurons. Deep learning uses networks with many layers."
Topic: Neural networks and deep learning

Key Advantage: Semantic chunking automatically identified the topic boundary, creating chunks that represent complete concepts rather than arbitrary text splits.

Example 4: Recursive Chunking - Handling Nested Structures

Scenario: A structured document with chapters, sections, and paragraphs.

Document Structure:

Chapter 1: Introduction
  Section 1.1: Overview (500 words)
  Section 1.2: History (800 words)
Chapter 2: Methods
  Section 2.1: Approach A (300 words)
  Section 2.2: Approach B (400 words)

Recursive Chunking Process:

Level 1: Try chapters

Chapter 1: 1,300 words (too large for 500-word chunks)
Chapter 2: 700 words (too large)
→ Move to next level

Level 2: Try sections

Section 1.1: 500 words (perfect size!) → Chunk 1
Section 1.2: 800 words (too large) → Move to next level
Section 2.1: 300 words (good size) → Chunk 2
Section 2.2: 400 words (good size) → Chunk 3

Level 3: Try paragraphs (for Section 1.2)

Paragraph 1: 200 words → Chunk 4
Paragraph 2: 250 words → Chunk 5
Paragraph 3: 350 words → Chunk 6

Final Result:

6 chunks total, each respecting document structure
Chunks maintain hierarchy (metadata links chunks to their parent sections/chapters)
✅ Structure is preserved while meeting size constraints

Example 5: Impact of Chunk Size on Retrieval

Scenario: Same document chunked with different sizes to show how chunk size affects retrieval precision.

Document: "Machine learning uses algorithms. Supervised learning requires labeled data. Unsupervised learning finds patterns. Neural networks have multiple layers. Deep learning uses many layers."

Query: "What is supervised learning?"

Small Chunks (50 characters each):

Chunk 1: "Machine learning uses algorithms. Supervised learning"
Chunk 2: "Supervised learning requires labeled data. Unsupervised"
Chunk 3: "Unsupervised learning finds patterns. Neural networks"
Chunk 4: "Neural networks have multiple layers. Deep learning"
Chunk 5: "Deep learning uses many layers."

Retrieval: Chunk 2 is retrieved (contains "supervised learning" and "labeled data")
Precision: High - chunk is highly relevant
Context: Limited - only mentions supervised learning briefly

Large Chunks (150 characters each):

Chunk 1: "Machine learning uses algorithms. Supervised learning requires labeled data. Unsupervised learning finds patterns."
Chunk 2: "Neural networks have multiple layers. Deep learning uses many layers."

Retrieval: Chunk 1 is retrieved (contains supervised learning)
Precision: Lower - chunk also contains unrelated info (unsupervised learning, algorithms)
Context: Rich - includes related concepts

Trade-off: Smaller chunks = more precise retrieval but less context. Larger chunks = more context but less precise retrieval. Choose based on your needs.

Implementation

Implementation Overview

This section provides practical Python code examples for implementing document processing and chunking in RAG systems. The examples use popular libraries like LangChain and NLTK for text processing, and demonstrate different chunking strategies with real-world scenarios.

1. Fixed-Size Chunking Implementation

What this does: Splits documents into chunks of fixed size (characters or tokens) with configurable overlap. Simple and fast, good for uniform content.

from langchain.text_splitter import CharacterTextSplitter
import re

class FixedSizeChunker:
    """
    Fixed-size chunking implementation with overlap support.
    
    This class splits documents into equal-sized chunks, preserving
    overlap at boundaries to maintain context.
    """
    
    def __init__(self, chunk_size=500, chunk_overlap=100):
        """
        Initialize chunker with size and overlap parameters.
        
        Args:
            chunk_size: Size of each chunk in characters
            chunk_overlap: Number of overlapping characters between chunks
        """
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.effective_size = chunk_size - chunk_overlap
    
    def chunk(self, text):
        """
        Split text into fixed-size chunks with overlap.
        
        Args:
            text: Input text to chunk
            
        Returns:
            List of chunk strings
        """
        if len(text) <= self.chunk_size:
            return [text]  # Text fits in one chunk
        
        chunks = []
        start = 0
        
        while start < len(text):
            # Calculate end position
            end = start + self.chunk_size
            
            # Extract chunk
            chunk = text[start:end]
            chunks.append(chunk)
            
            # Move start position (accounting for overlap)
            start += self.effective_size
        
        return chunks
    
    def chunk_with_metadata(self, text, metadata=None):
        """
        Chunk text and preserve metadata for each chunk.
        
        Args:
            text: Input text
            metadata: Dictionary of metadata (e.g., {'title': 'Doc1', 'author': 'John'})
            
        Returns:
            List of dictionaries with 'text' and 'metadata' keys
        """
        chunks = self.chunk(text)
        chunked_docs = []
        
        for i, chunk_text in enumerate(chunks):
            chunk_metadata = {
                **(metadata or {}),
                'chunk_index': i,
                'total_chunks': len(chunks)
            }
            chunked_docs.append({
                'text': chunk_text,
                'metadata': chunk_metadata
            })
        
        return chunked_docs

# Example usage
chunker = FixedSizeChunker(chunk_size=100, chunk_overlap=20)

document = "Machine learning is a subset of artificial intelligence. " \
           "It enables computers to learn from data without explicit programming. " \
           "Deep learning uses neural networks with multiple layers."

chunks = chunker.chunk(document)
print(f"Number of chunks: {len(chunks)}")
for i, chunk in enumerate(chunks, 1):
    print(f"\nChunk {i} ({len(chunk)} chars): {chunk}")

# Output:
# Number of chunks: 2
# Chunk 1 (100 chars): Machine learning is a subset of artificial intelligence. It enables computers to learn from data without explicit
# Chunk 2 (100 chars): from data without explicit programming. Deep learning uses neural networks with multiple layers.
# Note: "from data without explicit" appears in both chunks (overlap)

Key Points:

Simple implementation: Easy to understand and implement
Overlap handling: Each chunk after the first starts `chunk_size - overlap` characters into the previous chunk
Metadata preservation: Can attach metadata to each chunk for filtering and organization
Use case: Good for uniform text where structure doesn't matter

2. Sentence-Based Chunking Implementation

What this does: Splits documents at sentence boundaries, grouping multiple sentences into chunks. Preserves sentence integrity and improves semantic coherence.

import nltk
from nltk.tokenize import sent_tokenize
import re

# Download required NLTK data (run once)
# nltk.download('punkt')

class SentenceBasedChunker:
    """
    Sentence-based chunking that respects sentence boundaries.
    
    Groups sentences into chunks of approximately target size,
    ensuring no sentence is split across chunks.
    """
    
    def __init__(self, chunk_size=500, chunk_overlap=100):
        """
        Initialize sentence-based chunker.
        
        Args:
            chunk_size: Target chunk size in characters
            chunk_overlap: Overlap in characters (applied at sentence boundaries)
        """
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
    
    def chunk(self, text):
        """
        Split text into sentence-based chunks.
        
        Args:
            text: Input text to chunk
            
        Returns:
            List of chunk strings
        """
        # Split into sentences
        sentences = sent_tokenize(text)
        
        if not sentences:
            return [text]
        
        chunks = []
        current_chunk = []
        current_size = 0
        
        for sentence in sentences:
            sentence_size = len(sentence)
            
            # If adding this sentence would exceed chunk size, finalize current chunk
            if current_size + sentence_size > self.chunk_size and current_chunk:
                # Join sentences into chunk
                chunk_text = ' '.join(current_chunk)
                chunks.append(chunk_text)
                
                # Start new chunk with overlap (last few sentences)
                overlap_sentences = self._get_overlap_sentences(current_chunk)
                current_chunk = overlap_sentences + [sentence]
                current_size = sum(len(s) for s in current_chunk)
            else:
                current_chunk.append(sentence)
                current_size += sentence_size
        
        # Add final chunk
        if current_chunk:
            chunks.append(' '.join(current_chunk))
        
        return chunks
    
    def _get_overlap_sentences(self, sentences):
        """
        Get sentences for overlap based on target overlap size.
        
        Args:
            sentences: List of sentences in current chunk
            
        Returns:
            List of sentences to include in next chunk for overlap
        """
        overlap_sentences = []
        overlap_size = 0
        
        # Add sentences from the end until we reach overlap size
        for sentence in reversed(sentences):
            if overlap_size + len(sentence) <= self.chunk_overlap:
                overlap_sentences.insert(0, sentence)
                overlap_size += len(sentence)
            else:
                break
        
        return overlap_sentences

# Example usage
chunker = SentenceBasedChunker(chunk_size=150, chunk_overlap=30)

document = "Python is a versatile programming language. " \
           "It is widely used in data science and machine learning. " \
           "Many libraries like NumPy and Pandas make Python powerful. " \
           "Machine learning frameworks such as scikit-learn are built on Python."

chunks = chunker.chunk(document)
print(f"Number of chunks: {len(chunks)}")
for i, chunk in enumerate(chunks, 1):
    print(f"\nChunk {i} ({len(chunk)} chars):")
    print(chunk)

# Output:
# Number of chunks: 2
# Chunk 1: Python is a versatile programming language. It is widely used in data science and machine learning.
# Chunk 2: It is widely used in data science and machine learning. Many libraries like NumPy and Pandas make Python powerful.
# Note: Overlap preserves context at sentence boundaries

Key Points:

Sentence preservation: Never breaks sentences, maintaining grammatical integrity
Smart overlap: Overlap is applied at sentence boundaries, not arbitrary character positions
Better semantics: Related sentences stay together, improving embedding quality
Use case: Ideal for narrative text, articles, and prose where sentence structure matters

3. Recursive Chunking Implementation (LangChain)

What this does: Tries multiple chunking strategies in hierarchy (paragraphs → sentences → characters) until chunks fit size requirements. Handles nested document structures intelligently.

from langchain.text_splitter import RecursiveCharacterTextSplitter

class DocumentChunker:
    """
    Recursive chunking that tries multiple separators in order.
    
    This is the most robust chunking approach, handling various
    document structures automatically.
    """
    
    def __init__(self, chunk_size=1000, chunk_overlap=200):
        """
        Initialize recursive chunker.
        
        Args:
            chunk_size: Target chunk size in characters
            chunk_overlap: Overlap between chunks
        """
        self.splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            length_function=len,
            # Try these separators in order (most to least preferred)
            separators=[
                "\n\n",  # Paragraphs (double newline)
                "\n",    # Lines (single newline)
                ". ",    # Sentences (period + space)
                " ",     # Words (spaces)
                ""       # Characters (fallback)
            ]
        )
    
    def chunk_document(self, text, metadata=None):
        """
        Chunk document with metadata preservation.
        
        Args:
            text: Document text
            metadata: Optional metadata dictionary
            
        Returns:
            List of chunk dictionaries with text and metadata
        """
        # Split into chunks
        chunks = self.splitter.split_text(text)
        
        # Add metadata to each chunk
        chunked_docs = []
        for i, chunk_text in enumerate(chunks):
            chunk_metadata = {
                **(metadata or {}),
                'chunk_index': i,
                'total_chunks': len(chunks),
                'chunk_size': len(chunk_text)
            }
            chunked_docs.append({
                'text': chunk_text,
                'metadata': chunk_metadata
            })
        
        return chunked_docs
    
    def chunk_multiple_documents(self, documents):
        """
        Chunk multiple documents, preserving document-level metadata.
        
        Args:
            documents: List of dicts with 'text' and optional 'metadata'
            
        Returns:
            List of all chunks with preserved metadata
        """
        all_chunks = []
        
        for doc_idx, doc in enumerate(documents):
            text = doc.get('text', '')
            metadata = doc.get('metadata', {})
            metadata['document_index'] = doc_idx
            
            chunks = self.chunk_document(text, metadata)
            all_chunks.extend(chunks)
        
        return all_chunks

# Example usage
chunker = DocumentChunker(chunk_size=200, chunk_overlap=50)

# Example: Structured document with paragraphs
document = """Introduction

Machine learning is transforming industries. It enables computers to learn from data.

Methods

We use neural networks for pattern recognition. Deep learning models achieve state-of-the-art results.

Conclusion

The future of AI looks promising. Machine learning will continue to evolve."""

chunks = chunker.chunk_document(
    document,
    metadata={'title': 'ML Overview', 'author': 'John Doe', 'date': '2024-01-15'}
)

print(f"Number of chunks: {len(chunks)}")
for chunk in chunks:
    print(f"\nChunk {chunk['metadata']['chunk_index'] + 1}:")
    print(f"Text: {chunk['text'][:100]}...")
    print(f"Metadata: {chunk['metadata']}")

# The recursive splitter will:
# 1. Try splitting by "\n\n" (paragraphs) - succeeds for this document
# 2. If paragraphs are too large, try "\n" (lines)
# 3. If lines are too large, try ". " (sentences)
# 4. And so on...

Key Points:

Adaptive: Automatically chooses the best separator based on document structure
Hierarchical: Respects document hierarchy (paragraphs → sentences → words)
Robust: Handles various document formats (markdown, plain text, structured)
Metadata preservation: Maintains document-level and chunk-level metadata
Use case: Best for diverse document types or when you're unsure of document structure

4. Complete Document Processing Pipeline

What this does: A complete implementation that loads documents from various formats, processes them, chunks them, and prepares them for embedding and storage in a vector database.

import os
from pathlib import Path
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, TextLoader, UnstructuredHTMLLoader

class DocumentProcessor:
    """
    Complete document processing pipeline for RAG systems.
    
    Handles loading, cleaning, chunking, and metadata extraction
    from various document formats.
    """
    
    def __init__(self, chunk_size=1000, chunk_overlap=200):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            length_function=len
        )
    
    def load_document(self, file_path):
        """
        Load document from file based on extension.
        
        Args:
            file_path: Path to document file
            
        Returns:
            Document text and metadata
        """
        file_path = Path(file_path)
        extension = file_path.suffix.lower()
        
        # Extract base metadata
        metadata = {
            'source': str(file_path),
            'filename': file_path.name,
            'file_type': extension
        }
        
        # Load based on file type
        if extension == '.pdf':
            loader = PyPDFLoader(str(file_path))
            pages = loader.load()
            text = '\n\n'.join([page.page_content for page in pages])
            metadata['page_count'] = len(pages)
        
        elif extension in ['.txt', '.md']:
            loader = TextLoader(str(file_path), encoding='utf-8')
            document = loader.load()
            text = document[0].page_content
        
        elif extension in ['.html', '.htm']:
            loader = UnstructuredHTMLLoader(str(file_path))
            document = loader.load()
            text = document[0].page_content
        
        else:
            raise ValueError(f"Unsupported file type: {extension}")
        
        return text, metadata
    
    def process_document(self, file_path):
        """
        Complete processing: load, chunk, and prepare for embedding.
        
        Args:
            file_path: Path to document file
            
        Returns:
            List of chunk dictionaries ready for embedding
        """
        # Step 1: Load document
        text, doc_metadata = self.load_document(file_path)
        
        # Step 2: Clean text (remove excessive whitespace, normalize)
        text = self._clean_text(text)
        
        # Step 3: Chunk document
        chunks = self.splitter.split_text(text)
        
        # Step 4: Create chunk documents with metadata
        chunked_docs = []
        for i, chunk_text in enumerate(chunks):
            chunk_metadata = {
                **doc_metadata,
                'chunk_index': i,
                'total_chunks': len(chunks),
                'chunk_size': len(chunk_text)
            }
            chunked_docs.append({
                'text': chunk_text,
                'metadata': chunk_metadata
            })
        
        return chunked_docs
    
    def _clean_text(self, text):
        """Clean and normalize text."""
        # Remove excessive whitespace
        text = re.sub(r'\s+', ' ', text)
        # Remove special characters that might interfere
        text = text.strip()
        return text
    
    def process_directory(self, directory_path, file_patterns=None):
        """
        Process all documents in a directory.
        
        Args:
            directory_path: Path to directory containing documents
            file_patterns: List of file patterns to include (e.g., ['*.pdf', '*.txt'])
            
        Returns:
            List of all chunks from all documents
        """
        directory = Path(directory_path)
        all_chunks = []
        
        # Default patterns
        if file_patterns is None:
            file_patterns = ['*.pdf', '*.txt', '*.md', '*.html']
        
        # Find all matching files
        files = []
        for pattern in file_patterns:
            files.extend(directory.glob(pattern))
        
        # Process each file
        for file_path in files:
            try:
                chunks = self.process_document(file_path)
                all_chunks.extend(chunks)
                print(f"Processed {file_path.name}: {len(chunks)} chunks")
            except Exception as e:
                print(f"Error processing {file_path}: {e}")
        
        return all_chunks

# Example usage
processor = DocumentProcessor(chunk_size=500, chunk_overlap=100)

# Process a single document
chunks = processor.process_document('document.pdf')
print(f"Created {len(chunks)} chunks from document")

# Process entire directory
all_chunks = processor.process_directory('./documents/', file_patterns=['*.pdf', '*.txt'])
print(f"Total chunks from all documents: {len(all_chunks)}")

# Now chunks are ready for:
# 1. Embedding generation
# 2. Storage in vector database
# 3. Retrieval in RAG system

Complete Pipeline Steps:

Document Loading: Loads from PDF, TXT, MD, HTML formats
Text Cleaning: Normalizes whitespace and removes artifacts
Chunking: Splits into manageable chunks with overlap
Metadata Extraction: Preserves document properties (filename, type, page count)
Ready for Embedding: Chunks are prepared for vector database storage

Installation Requirements

To run these examples, install the required packages:

pip install langchain nltk pypdf unstructured

For NLTK sentence tokenization, download the punkt tokenizer (run once):

import nltk
nltk.download('punkt')

Real-World Applications

Retrieval Strategy Selection

Use dense retrieval when:

Semantic understanding is important
Users may phrase queries differently
Domain-specific terminology

Use sparse retrieval when:

Exact keyword matching is important
Speed is critical
Technical documentation with specific terms

Use hybrid when:

You want best of both worlds
High accuracy is required
Can afford extra computation

Reranking Benefits

When to use reranking:

Initial retrieval returns many candidates
Need high precision in top results
Can afford additional latency
Quality is more important than speed

Test Your Understanding

Question 1: Why is document chunking important in RAG systems?

A) LLMs have context limits, chunking breaks documents into manageable pieces that fit in context windows while preserving semantic meaning

B) Chunking helps reduce network bandwidth by sending smaller pieces of data, which improves transmission speed but isn't the core reason in RAG

C) While chunking can help with storage optimization and file organization, the primary reason in RAG systems is actually to enable efficient embedding generation and retrieval, not just to make documents smaller or reduce storage costs

D) To speed up processing

Question 2: Interview question: "What are the different chunking strategies and when would you use each?"

A) Chunking serves multiple purposes including document organization and storage efficiency, but in RAG systems the critical function is breaking documents into semantically meaningful units that can be effectively embedded and retrieved, not just reducing document size

B) Fixed-size: Simple, fast, good for uniform text. Semantic: Preserves meaning, better for varied content. Recursive: Handles nested structures. Use fixed-size for speed, semantic for quality, recursive for structured documents

C) To speed up processing

D) Document chunking is mainly about formatting and styling text for better presentation, which is useful but not critical for retrieval

Question 3: What is chunk overlap and why is it used?

A) Chunking helps reduce network bandwidth by sending smaller pieces of data, which improves transmission speed but isn't the core reason in RAG

B) While chunking can help with storage optimization and file organization, the primary reason in RAG systems is actually to enable efficient embedding generation and retrieval, not just to make documents smaller or reduce storage costs

C) Overlapping chunks share some content to preserve context at boundaries, preventing information loss when sentences/paragraphs are split

D) To reduce file size

Question 4: Interview question: "How do you determine optimal chunk size?"

A) The main purpose of chunking is to organize documents into categories and folders for better file management in storage systems

B) Balance LLM context window, retrieval precision (smaller = more precise), and semantic completeness (larger = more context). Common: 200-1000 tokens. Test on your data and downstream task

C) To speed up processing

D) Chunking serves multiple purposes including document organization and storage efficiency, but in RAG systems the critical function is breaking documents into semantically meaningful units that can be effectively embedded and retrieved, not just reducing document size

Question 5: What is semantic chunking?

B) Document chunking is mainly about formatting and styling text for better presentation, which is useful but not critical for retrieval

C) Chunking based on semantic boundaries (sentences, paragraphs, topics) rather than fixed sizes, preserving meaning and context

D) To organize files better

Question 6: Interview question: "How do you handle different document types (PDF, HTML, Markdown) in RAG?"

A) This comprehensive approach has been considered but doesn't work well in practice

B) While this might seem reasonable, it's not the correct approach

C) This doesn't work

D) Use appropriate parsers (PyPDF2, BeautifulSoup, markdown), extract text while preserving structure, handle metadata, and apply document-type-specific chunking strategies

Question 7: What is metadata extraction in document processing?

A) Chunking helps reduce network bandwidth by sending smaller pieces of data, which improves transmission speed but isn't the core reason in RAG

B) To reduce file size

C) Extracting document properties (title, author, date, source, section) to enable filtering and better retrieval in vector databases

Question 8: Interview question: "How do you preserve context when chunking documents?"

A) To organize files better

B) Although chunking does result in smaller document pieces, the main purpose in RAG is to ensure that retrieved content fits within LLM context windows while maintaining semantic coherence, rather than simply reducing file sizes or storage requirements

C) The main purpose of chunking is to organize documents into categories and folders for better file management in storage systems

D) Use chunk overlap, preserve sentence/paragraph boundaries, include surrounding context in metadata, and use semantic chunking to keep related content together

Question 9: What are the trade-offs between small and large chunk sizes?

A) To reduce file size

B) Chunking is primarily used to reduce storage costs by compressing document content, though this doesn't address the main RAG requirements

C) Small chunks: More precise retrieval but may lose context. Large chunks: More context but less precise retrieval. Balance based on query type and document structure

D) While chunking can help with storage optimization and file organization, the primary reason in RAG systems is actually to enable efficient embedding generation and retrieval, not just to make documents smaller or reduce storage costs

Question 10: Interview question: "How would you handle very long documents (e.g., books) in RAG?"

A) Use hierarchical chunking (chapters → sections → paragraphs), maintain document structure in metadata, use multi-level retrieval, and consider document summarization for overview

B) While this might seem reasonable, it's not the correct approach

C) This is incorrect

D) This comprehensive approach has been considered but doesn't work well in practice

Question 11: What is recursive chunking?

B) To organize files better

C) Chunking that tries multiple strategies in order (e.g., paragraphs → sentences → characters) until chunks fit size requirements, handling nested document structures

D) Chunking is primarily used to reduce storage costs by compressing document content, though this doesn't address the main RAG requirements

Question 12: Interview question: "How do you evaluate chunking quality?"

A) To speed up processing

B) Chunking helps reduce network bandwidth by sending smaller pieces of data, which improves transmission speed but isn't the core reason in RAG

C) Measure retrieval performance (precision@k, recall@k), test downstream RAG quality, check if relevant information is preserved, and evaluate chunk coherence