Chapter 4: Overfitting and Pruning

Learn how to prevent overfitting and improve generalization using various pruning techniques.

Learning Objectives

  • Understand what overfitting is and why it happens
  • Learn pre-pruning techniques to prevent overfitting
  • Master post-pruning methods for better generalization
  • Use cross-validation to evaluate model performance
  • Apply pruning techniques in practice

Understanding Overfitting

🎯 Overfitting: When Trees Get Too Smart

Imagine memorizing answers for a test instead of understanding the concepts. You might ace that specific test, but fail on new questions. That's exactly what overfitting is in machine learning!

Overfitting occurs when a decision tree becomes too complex and memorizes the training data instead of learning general patterns. This leads to poor performance on new, unseen data.

Signs of Overfitting

  • High Training Accuracy, Low Test Accuracy: The model performs well on training data but poorly on new data
  • Very Deep Trees: Trees with many levels that capture noise
  • Many Small Leaves: Each leaf contains very few samples
  • Perfect Training Performance: 100% accuracy on training data is often suspicious

Why Overfitting Happens

📊 Small Dataset

Limited data makes it hard to learn general patterns

�� High Dimensionality

Many features relative to samples

�� No Stopping Criteria

Tree grows until perfect training fit

📈 Noise in Data

Tree learns from random errors

Pre-pruning

✂️ Pre-pruning: Stopping Growth Early

Pre-pruning prevents overfitting by stopping the tree from growing too complex in the first place. It's like setting rules before you start building.

Pre-pruning Techniques

📏 Maximum Depth

Limit how deep the tree can grow

DecisionTreeClassifier(max_depth=3)

�� Minimum Samples Split

Require minimum samples to create a split

DecisionTreeClassifier(min_samples_split=10)

�� Minimum Samples Leaf

Require minimum samples in leaf nodes

DecisionTreeClassifier(min_samples_leaf=5)

📈 Minimum Impurity Decrease

Only split if it improves purity significantly

DecisionTreeClassifier(min_impurity_decrease=0.01)

Post-pruning

🌳 Post-pruning: Trim After Growing

Post-pruning grows the full tree first, then removes branches that don't improve generalization. It's like trimming a bush after it's grown.

Cost Complexity Pruning

Cost complexity pruning (also called weakest link pruning) removes branches that provide the least improvement in generalization.

Cost Complexity Formula

Cost = Error + α × Complexity

Where:

  • Error: Misclassification rate
  • α: Complexity parameter (controls pruning)
  • Complexity: Number of leaves in the tree

Cross-Validation

✅ Cross-Validation: Testing Without Cheating

Cross-validation helps you evaluate your model's performance without using the test set during training. It's like taking multiple practice tests to see how well you really understand the material.

K-Fold Cross-Validation

5-Fold Cross-Validation Process

Fold 1: Train on 80%, Test on 20%
Fold 2: Train on 80%, Test on 20%
Fold 3: Train on 80%, Test on 20%
Fold 4: Train on 80%, Test on 20%
Fold 5: Train on 80%, Test on 20%
Average Score

Interactive Pruning Demo

�� Experiment with Pruning

Try different pruning techniques and see how they affect the tree structure and performance!

Click "Generate Data" to start

Tree visualization will appear here

Chapter 4 Quiz

�� Test Your Pruning Knowledge

Answer these questions about overfitting and pruning!

Question 1: What is the main sign of overfitting?

Question 2: Which technique prevents overfitting by stopping tree growth early?