Chapter 4: Ensemble Methods - The Power of Many

Why Single Models Aren't Enough

Even with regularization, single models have inherent limitations. Each model represents one perspective on the data. What if we could combine multiple perspectives?

The Ensemble Solution

Ensemble methods combine predictions from multiple models to create a stronger predictor:

Train Multiple Models

→

Combine Predictions

→

Final Prediction

Key Insight: Individual models make different types of errors. When combined intelligently, these errors can cancel out!

Interactive Ensemble Voting Demo

Click on models to see how ensemble voting works:

Yes

85% conf

Model 1

No

73% conf

Model 2

Yes

91% conf

Model 3

Yes

67% conf

Model 4

No

79% conf

Model 5

YES

3 models vote YES, 2 vote NO

Average confidence: 79%

Two Main Approaches

There are two fundamental ways to create ensemble models:

Bagging (Bootstrap Aggregating)

Strategy: Train models independently on different data samples

Models trained in parallel
Each model sees different data subset
Final prediction: average/vote
Reduces variance

Example: Random Forest

Boosting

Strategy: Train models sequentially, each fixing previous errors

Models trained sequentially
Each model focuses on previous mistakes
Final prediction: weighted combination
Reduces bias

Example: XGBoost

The Wisdom of Crowds Principle

Ensemble methods work because of a fundamental principle: diverse, independent predictions are often more accurate than any single prediction.

The Classic Example: Guessing the Weight of a Bull

In 1906, Francis Galton observed 787 people guessing the weight of a bull at a fair:

Individual guesses: Widely varying, many quite wrong
Average of all guesses: Within 1% of the actual weight!
Key insight: Errors in different directions canceled out

Interactive Wisdom of Crowds Demo

Select different types of models to see how diversity affects ensemble performance:

Linear Models Decision Trees SVM K-NN Naive Bayes

0.84

Best Individual

0.89

Ensemble

0.73

Diversity

+6%

Improvement

High Model Diversity

Your selected models make different types of errors, leading to strong ensemble performance!

Requirements for Ensemble Success

For ensembles to work effectively, you need:

Diversity

Models should make different types of mistakes

How: Different algorithms, features, or training data

Individual Competence

Each model should be better than random

How: Proper training and validation

Independence

Models shouldn't all fail the same way

How: Different data samples or approaches

Bagging: Independent Parallel Models

Bootstrap Aggregating (Bagging) creates diverse models by training each on a different random sample of the data.

Comparison diagram showing bagging (parallel training) versus boosting (sequential training)

Interactive Bagging Demonstration

Watch how bagging creates diverse models from the same dataset:

Bootstrap Sample Size (% of original): 75%

Number of Models: 10

Original Data
1000 samples

→

Bootstrap Samples
10 × 750 samples

→

Train Models
In Parallel

→

Average Predictions
Final Result

0.12

Variance

0.08

Bias

0.87

Accuracy

How Bagging Works

1. Bootstrap
Sample with replacement

→

2. Train
Independent models

→

3. Aggregate
Average/Vote

Key Benefits:

Variance Reduction: Averaging reduces prediction variance
Overfitting Control: Individual overfitting gets averaged out
Parallelizable: Models can be trained simultaneously
Robustness: Less sensitive to outliers

When Bagging Works Best:

Base models have high variance (like deep decision trees)
You have sufficient data for multiple samples
Models can be trained independently
You want to reduce overfitting

Boosting: Sequential Error Correction

Boosting takes a different approach: train models sequentially, with each new model focusing on the errors made by previous models.

Interactive Boosting Demonstration

Watch how boosting iteratively improves predictions:

Number of Boosting Rounds: 1

0.15

Training Error

0.12

Bias

0.85

Accuracy

Bagging vs Boosting: The Key Differences

Bagging Characteristics

Training: Parallel/Independent
Focus: Reduce variance
Base Models: Often complex (high variance)
Combination: Simple average/majority vote
Overfitting Risk: Lower
Speed: Can parallelize

Boosting Characteristics

Training: Sequential/Adaptive
Focus: Reduce bias
Base Models: Often simple (high bias)
Combination: Weighted combination
Overfitting Risk: Higher (but powerful)
Speed: Sequential (slower)

Which Should You Choose?

Choose Bagging when: Base models overfit, you want stability, you can parallelize
Choose Boosting when: Base models underfit, you want maximum accuracy, you're willing to tune carefully
In practice: Random Forest (bagging) for quick, robust results; XGBoost (boosting) for competitions and maximum performance

Chapter 4 Quiz

Test your understanding of ensemble methods:

Question 1: What is the main difference between bagging and boosting?

Bagging trains models in parallel, boosting trains sequentially

Bagging is more accurate than boosting

Boosting always uses decision trees

Bagging requires more data than boosting

Correct! This is the fundamental difference: bagging creates independent models simultaneously, while boosting builds models sequentially where each new model learns from the mistakes of previous ones.

Question 2: Why do ensemble methods often outperform single models?

They always use more data

Different models make different errors that can cancel out when combined

They are faster to train

They require less feature engineering

Exactly! The wisdom of crowds principle: when diverse models make different types of errors, combining their predictions can cancel out individual mistakes, leading to more accurate overall predictions.

Question 3: When would you prefer bagging over boosting?

When you need the highest possible accuracy

When you have high-variance base models and want stability

When you have very simple base models

When interpretability is most important

Perfect! Bagging is ideal for high-variance models (like deep decision trees) because averaging their predictions reduces variance and creates more stable, robust predictions.