Learning Objectives

By the end of this chapter, you will be able to:

Apply Bayes' theorem to classification problems.
Explain the conditional independence assumption behind Naive Bayes.
Implement and evaluate Naive Bayes on small practical examples.

🧠 Complete Guide to Naive Bayes

📚 What is Naive Bayes?

Naive Bayes is a family of probabilistic algorithms based on applying Bayes' theorem with the "naive" assumption of conditional independence between every pair of features. Despite this strong assumption, it works surprisingly well for many real-world problems, especially text classification and spam filtering.

🔢 The Mathematical Foundation

Bayes' theorem forms the core of this algorithm:

\[ P(A \mid B) = \frac{P(B \mid A)P(A)}{P(B)} \]

Definitions: \(P(A \mid B)\) is the probability of hypothesis \(A\) after observing evidence \(B\), \(P(B \mid A)\) is the likelihood, \(P(A)\) is the prior, and \(P(B)\) normalizes the result.

Interpretation: update what you believed before seeing the evidence by measuring how likely that evidence is under the hypothesis.

For classification, this becomes:

\[ P(y \mid \mathbf{x}) = \frac{P(\mathbf{x} \mid y)P(y)}{P(\mathbf{x})} \]

Definitions: \(y\) is a class label and \(\mathbf{x}\) is the feature vector for one example.

Assumption: for choosing the winning class, \(P(\mathbf{x})\) is the same for every class, so we compare proportional scores instead of calculating the denominator.

The "naive" assumption means we assume all features are independent:

\[ P(\mathbf{x} \mid y) = \prod_{i=1}^{n} P(x_i \mid y) \]

Definitions: \(x_i\) is one feature value and \(n\) is the number of features.

Common mistake: the features rarely are truly independent; the assumption is a useful simplification, not a claim about the real world.

📊 Simple Example: Weather Prediction

Dataset: Will we play tennis based on weather?

Day	Outlook	Temperature	Humidity	Wind	Play Tennis?
1	Sunny	Hot	High	Weak	No
2	Sunny	Hot	High	Strong	No
3	Overcast	Hot	High	Weak	Yes
4	Rain	Mild	High	Weak	Yes
5	Rain	Cool	Normal	Weak	Yes
6	Rain	Cool	Normal	Strong	No
7	Overcast	Cool	Normal	Strong	Yes
8	Sunny	Mild	High	Weak	No
9	Sunny	Cool	Normal	Weak	Yes
10	Rain	Mild	Normal	Weak	Yes
11	Sunny	Mild	Normal	Strong	Yes
12	Overcast	Mild	High	Strong	Yes
13	Overcast	Hot	Normal	Weak	Yes
14	Rain	Mild	High	Strong	No

Step-by-Step Calculation

Let's predict: Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong

1Prior Probabilities:

\(P(\text{Yes}) = 9/14 =\) 0.6429

\(P(\text{No}) = 5/14 =\) 0.3571

2Likelihood Calculations:

Use Laplace smoothing with \(\alpha = 1\):

\[ P(x_i = v \mid y) = \frac{\operatorname{count}(x_i = v, y) + \alpha}{\operatorname{count}(y) + \alpha k_i} \]

Here, \(k_i\) is the number of possible values for the feature. Outlook and temperature have 3 values; humidity and wind have 2 values.

For Play=Yes:

\(P(\text{Sunny} \mid \text{Yes}) = (2 + 1)/(9 + 3) = 0.2500\)
\(P(\text{Cool} \mid \text{Yes}) = (3 + 1)/(9 + 3) = 0.3333\)
\(P(\text{High} \mid \text{Yes}) = (3 + 1)/(9 + 2) = 0.3636\)
\(P(\text{Strong} \mid \text{Yes}) = (3 + 1)/(9 + 2) = 0.3636\)

For Play=No:

\(P(\text{Sunny} \mid \text{No}) = (3 + 1)/(5 + 3) = 0.5000\)
\(P(\text{Cool} \mid \text{No}) = (1 + 1)/(5 + 3) = 0.2500\)
\(P(\text{High} \mid \text{No}) = (4 + 1)/(5 + 2) = 0.7143\)
\(P(\text{Strong} \mid \text{No}) = (3 + 1)/(5 + 2) = 0.5714\)

3Final Calculation:

\(P(\text{Yes} \mid \mathbf{x}) \propto 0.6429 \times 0.2500 \times 0.3333 \times 0.3636 \times 0.3636 =\) 0.0071

\(P(\text{No} \mid \mathbf{x}) \propto 0.3571 \times 0.5000 \times 0.2500 \times 0.7143 \times 0.5714 =\) 0.0182

After normalization: \(P(\text{Yes} \mid \mathbf{x}) = 28.0\%\) and \(P(\text{No} \mid \mathbf{x}) = 72.0\%\).

Prediction: No (Don't play tennis)

📈 Visualization of Feature Distributions

💻 Python Implementation

From Scratch Implementation

import numpy as np
import pandas as pd
from collections import defaultdict
import matplotlib.pyplot as plt

class NaiveBayesClassifier:
    def __init__(self):
        self.class_probs = {}
        self.feature_probs = defaultdict(lambda: defaultdict(dict))
        self.classes = []
        
    def fit(self, X, y):
        """Train the Naive Bayes classifier"""
        self.classes = np.unique(y)
        n_samples = len(y)
        
        # Calculate class probabilities
        for cls in self.classes:
            self.class_probs[cls] = np.sum(y == cls) / n_samples
        
        # Calculate feature probabilities
        for feature_idx in range(X.shape[1]):
            feature_values = np.unique(X[:, feature_idx])
            
            for cls in self.classes:
                class_mask = (y == cls)
                class_samples = X[class_mask]
                
                for value in feature_values:
                    count = np.sum(class_samples[:, feature_idx] == value)
                    # Add Laplace smoothing
                    self.feature_probs[feature_idx][cls][value] = (
                        (count + 1) / (np.sum(class_mask) + len(feature_values))
                    )
    
    def predict(self, X):
        """Predict classes"""
        predictions = []
        
        for sample in X:
            class_scores = {}
            
            for cls in self.classes:
                # Start with class prior
                score = self.class_probs[cls]
                
                # Multiply by feature likelihoods
                for feature_idx, feature_value in enumerate(sample):
                    if feature_value in self.feature_probs[feature_idx][cls]:
                        score *= self.feature_probs[feature_idx][cls][feature_value]
                
                class_scores[cls] = score
            
            predictions.append(max(class_scores, key=class_scores.get))
        
        return predictions

# Example usage
weather_data = [
    ['Sunny', 'Hot', 'High', 'Weak', 'No'],
    ['Sunny', 'Hot', 'High', 'Strong', 'No'],
    # ... more data
]

X = np.array([row[:-1] for row in weather_data])
y = np.array([row[-1] for row in weather_data])

nb = NaiveBayesClassifier()
nb.fit(X, y)
    

🎯 Interactive Demo

🎾 Tennis Playing Predictor

Select weather conditions to predict if tennis will be played:

Outlook:

Temperature:

Humidity:

Wind:

📊 Performance Visualization

🔍 Types of Naive Bayes

1. Gaussian Naive Bayes

Used for continuous features that follow a normal distribution.

\[ P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma_y^2}}\exp\left(-\frac{(x_i-\mu_y)^2}{2\sigma_y^2}\right) \]

\(\mu_y\) and \(\sigma_y^2\) are estimated from the training samples in class \(y\).

2. Multinomial Naive Bayes

Used for discrete counts (e.g., word counts in text classification).

\[ P(x_i \mid y) = \frac{N_{yi} + \alpha}{N_y + \alpha n} \]

\(N_{yi}\) is the count of feature \(i\) in class \(y\), \(N_y\) is the total count for class \(y\), \(n\) is the vocabulary size, and \(\alpha\) controls smoothing.

3. Bernoulli Naive Bayes

Used for binary/boolean features.

\[ P(x_i \mid y) = p_{iy}^{x_i}(1 - p_{iy})^{1 - x_i} \]

\(x_i\) is 0 or 1, and \(p_{iy}\) is the probability that feature \(i\) is present in class \(y\).

✅ Advantages and Disadvantages

✅ Advantages:

🚀 Simple and Fast: Easy to implement and computationally efficient
📈 Good Performance: Works well with small datasets
🛡️ No Overfitting: Less prone to overfitting, especially with small data
🎯 Handles Multiple Classes: Naturally handles multi-class classification
📊 Good Baseline: Excellent baseline for comparison with other algorithms
🎲 Probabilistic Output: Provides probability estimates

❌ Disadvantages:

🔗 Independence Assumption: Assumes features are independent (rarely true)
🔍 Categorical Inputs: Requires Laplace smoothing for categorical inputs
⚡ Limited Expressiveness: Cannot learn interactions between features
📊 Skewed Data: Can be biased if training data is not representative

🚀 Real-World Applications

📧 Email Spam Filtering

Classic application using word frequencies to classify emails as spam or legitimate.

📰 Text Classification

News categorization, sentiment analysis, and document classification.

⚕️ Medical Diagnosis

Based on symptoms and test results to predict diseases.

🌤️ Weather Prediction

Based on atmospheric conditions and historical data.

🎬 Recommendation Systems

Content-based filtering for movies, books, and products.

⚡ Real-time Predictions

Due to its computational efficiency in production systems.

Tips for Better Performance

1Laplace Smoothing: Add small constant to avoid zero probabilities
2Feature Selection: Remove highly correlated features
3Data Preprocessing: Handle missing values and outliers
4Cross-Validation: Use proper validation techniques
5Feature Engineering: Create meaningful features from raw data
6Ensemble Methods: Combine with other algorithms