Chapter 4: Overfitting and Pruning
Learn how to prevent overfitting and improve generalization using various pruning techniques.
Learning Objectives
- Understand what overfitting is and why it happens
- Learn pre-pruning techniques to prevent overfitting
- Master post-pruning methods for better generalization
- Use cross-validation to evaluate model performance
- Apply pruning techniques in practice
Understanding Overfitting
🎯 Overfitting: When Trees Get Too Smart
Imagine memorizing answers for a test instead of understanding the concepts. You might ace that specific test, but fail on new questions. That's exactly what overfitting is in machine learning!
Overfitting occurs when a decision tree becomes too complex and memorizes the training data instead of learning general patterns. This leads to poor performance on new, unseen data.
Signs of Overfitting
- High Training Accuracy, Low Test Accuracy: The model performs well on training data but poorly on new data
- Very Deep Trees: Trees with many levels that capture noise
- Many Small Leaves: Each leaf contains very few samples
- Perfect Training Performance: 100% accuracy on training data is often suspicious
Why Overfitting Happens
📊 Small Dataset
Limited data makes it hard to learn general patterns
�� High Dimensionality
Many features relative to samples
�� No Stopping Criteria
Tree grows until perfect training fit
📈 Noise in Data
Tree learns from random errors
Pre-pruning
✂️ Pre-pruning: Stopping Growth Early
Pre-pruning prevents overfitting by stopping the tree from growing too complex in the first place. It's like setting rules before you start building.
Pre-pruning Techniques
📏 Maximum Depth
Limit how deep the tree can grow
DecisionTreeClassifier(max_depth=3)
�� Minimum Samples Split
Require minimum samples to create a split
DecisionTreeClassifier(min_samples_split=10)
�� Minimum Samples Leaf
Require minimum samples in leaf nodes
DecisionTreeClassifier(min_samples_leaf=5)
📈 Minimum Impurity Decrease
Only split if it improves purity significantly
DecisionTreeClassifier(min_impurity_decrease=0.01)
Post-pruning
🌳 Post-pruning: Trim After Growing
Post-pruning grows the full tree first, then removes branches that don't improve generalization. It's like trimming a bush after it's grown.
Cost Complexity Pruning
Cost complexity pruning (also called weakest link pruning) removes branches that provide the least improvement in generalization.
Cost Complexity Formula
Cost = Error + α × Complexity
Where:
- Error: Misclassification rate
- α: Complexity parameter (controls pruning)
- Complexity: Number of leaves in the tree
Cross-Validation
✅ Cross-Validation: Testing Without Cheating
Cross-validation helps you evaluate your model's performance without using the test set during training. It's like taking multiple practice tests to see how well you really understand the material.
K-Fold Cross-Validation
5-Fold Cross-Validation Process
Interactive Pruning Demo
�� Experiment with Pruning
Try different pruning techniques and see how they affect the tree structure and performance!
Click "Generate Data" to start
Tree visualization will appear here
Chapter 4 Quiz
�� Test Your Pruning Knowledge
Answer these questions about overfitting and pruning!