Chapter 4: Backpropagation Algorithm

The Learning Mechanism - Understanding how neural networks learn by computing gradients and updating weights

Learning Objectives

  • Understand the chain rule and its role in backpropagation
  • Master loss functions and error computation
  • Learn the backward pass step-by-step
  • Understand gradient computation for each layer
  • Implement backpropagation from scratch
  • Understand vanishing and exploding gradients

What is Backpropagation?

The Learning Algorithm

Backpropagation (backward propagation of errors) is the algorithm that enables neural networks to learn. It computes gradients of the loss function with respect to each weight and bias, allowing the network to update its parameters and improve predictions.

The Process:

  1. Forward Pass: Compute predictions using current weights
  2. Compute Loss: Measure how wrong predictions are
  3. Backward Pass: Compute gradients (how to change weights)
  4. Update Weights: Adjust weights to reduce loss
  5. Repeat: Continue until loss is minimized

📚 Real-World Analogy: Learning to Throw

Think of learning to throw a ball into a basket:

  • Forward Pass: You throw the ball (make a prediction)
  • Loss: You see how far off you were (measure error)
  • Backpropagation: You figure out what to adjust (angle, force, etc.)
  • Update: You adjust your throwing technique
  • Repeat: Keep practicing until you get it right!

Key Insight: Backpropagation tells you exactly how to adjust each "muscle" (weight) to improve your "throw" (prediction)!

Why "Backward"?

Backpropagation works backward through the network because:

  • Error starts at output: We know the error at the final layer
  • Propagate backward: We compute how much each previous layer contributed to the error
  • Chain rule: Errors flow backward through the network using calculus
  • Efficient: One backward pass computes all gradients at once

The Chain Rule: Foundation of Backpropagation

🔗 Mathematical Foundation

The chain rule from calculus is the mathematical foundation of backpropagation. It allows us to compute derivatives of composite functions, which is exactly what neural networks are!

Chain Rule for Single Variable

If \(y = f(g(x))\), then:

\[\frac{dy}{dx} = \frac{df}{dg} \times \frac{dg}{dx}\]
Intuition:

To find how y changes with x, we need to know:

  • How y changes with g (df/dg)
  • How g changes with x (dg/dx)

Multiply them together!

Simple Example

Example: y = (2x + 1)³

Let g = 2x + 1, so y = g³

Using chain rule:

  • dy/dg = 3g² = 3(2x + 1)²
  • dg/dx = 2
  • dy/dx = (dy/dg) × (dg/dx) = 3(2x + 1)² × 2 = 6(2x + 1)²

Verification: Direct differentiation gives the same result!

Chain Rule for Multiple Variables

For neural networks, we need the multivariable chain rule:

If \(z = f(x, y)\) where \(x = g(t)\) and \(y = h(t)\), then:

\[\frac{dz}{dt} = \frac{\partial f}{\partial x} \frac{dx}{dt} + \frac{\partial f}{\partial y} \frac{dy}{dt}\]
In Neural Networks:
  • Each layer's output depends on previous layers
  • Error propagates backward through all paths
  • We sum contributions from all paths (sum rule)

Chain Rule Visualization

import numpy as np

# Example: Computing gradient through a simple network
# y = f(g(h(x)))
# where h(x) = x², g(u) = 2u + 1, f(v) = v³

def h(x):
    return x**2

def g(u):
    return 2*u + 1

def f(v):
    return v**3

# Forward pass
x = 3
u = h(x)  # u = 9
v = g(u)  # v = 19
y = f(v)  # y = 6859

# Backward pass (chain rule)
df_dv = 3 * v**2  # = 3 * 19² = 1083
dg_du = 2
dh_dx = 2 * x     # = 2 * 3 = 6

# Chain rule: dy/dx = (df/dv) × (dg/du) × (dh/dx)
dy_dx = df_dv * dg_du * dh_dx
print(f"dy/dx = {dy_dx}")

# Verification: y = (2x² + 1)³
# dy/dx = 3(2x² + 1)² × 4x = 12x(2x² + 1)²
# At x=3: 12×3×(2×9+1)² = 36×19² = 12996 ✓

Loss Functions: Measuring Error

📏 What is a Loss Function?

A loss function measures how wrong our predictions are. It quantifies the difference between predicted and actual values. The goal of training is to minimize this loss.

Common Loss Functions

1. Mean Squared Error (MSE) - For Regression

\[L = \frac{1}{n} \sum_{i=1}^{n} (y_{\text{pred}} - y_{\text{true}})^2\]
Properties:
  • Always positive: Squared terms ensure L ≥ 0
  • Penalizes large errors: Squaring amplifies big mistakes
  • Differentiable: Smooth curve, easy to optimize
  • Best for: Regression tasks (predicting continuous values)

2. Cross-Entropy Loss - For Classification

Binary Cross-Entropy:

\[L = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_{\text{true}} \log(y_{\text{pred}}) + (1-y_{\text{true}}) \log(1-y_{\text{pred}}) \right]\]

Multi-class Cross-Entropy:

\[L = -\frac{1}{n} \sum_{i=1}^{n} \sum_{c=1}^{C} y_{\text{true},c} \log(y_{\text{pred},c})\]
Why Cross-Entropy?
  • Works with probabilities: Designed for classification
  • Large penalty for confident wrong predictions: If y_pred ≈ 0 but y_true = 1, loss → ∞
  • Small penalty for correct predictions: If y_pred ≈ y_true, loss → 0
  • Best for: Classification tasks

Loss Function Implementation

import numpy as np

def mse_loss(y_pred, y_true):
    """
    Mean Squared Error
    
    Parameters:
    y_pred: Predicted values (n_samples,)
    y_true: True values (n_samples,)
    """
    return np.mean((y_pred - y_true)**2)

def mse_loss_derivative(y_pred, y_true):
    """Derivative of MSE with respect to y_pred"""
    return 2 * (y_pred - y_true) / len(y_pred)

def binary_cross_entropy(y_pred, y_true, epsilon=1e-15):
    """
    Binary Cross-Entropy Loss
    
    Parameters:
    y_pred: Predicted probabilities (n_samples,)
    y_true: True labels (0 or 1) (n_samples,)
    epsilon: Small value to prevent log(0)
    """
    # Clip to prevent log(0)
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

def binary_cross_entropy_derivative(y_pred, y_true, epsilon=1e-15):
    """Derivative of BCE with respect to y_pred"""
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -(y_true / y_pred - (1 - y_true) / (1 - y_pred)) / len(y_pred)

def cross_entropy(y_pred, y_true, epsilon=1e-15):
    """
    Multi-class Cross-Entropy Loss
    
    Parameters:
    y_pred: Predicted probabilities (n_samples, n_classes)
    y_true: True labels (one-hot encoded) (n_samples, n_classes)
    """
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(np.sum(y_true * np.log(y_pred), axis=1))

# Example usage
y_pred_reg = np.array([2.1, 3.5, 1.8])
y_true_reg = np.array([2.0, 3.0, 2.0])
print(f"MSE: {mse_loss(y_pred_reg, y_true_reg):.4f}")

y_pred_cls = np.array([0.9, 0.1, 0.8])
y_true_cls = np.array([1, 0, 1])
print(f"BCE: {binary_cross_entropy(y_pred_cls, y_true_cls):.4f}")

The Backward Pass: Computing Gradients

⬅️ Working Backward Through Layers

The backward pass computes gradients starting from the output layer and moving backward to the input layer. We use the chain rule to propagate errors through each layer.

Step-by-Step Backward Pass

1. Output Layer Gradient

Start with the loss gradient at the output:

\[\frac{\partial L}{\partial a^{(L)}} = \text{derivative of loss with respect to output}\]
For MSE Loss:
  • L = (1/2)(y_pred - y_true)²
  • ∂L/∂y_pred = y_pred - y_true (error signal!)

2. Gradient Through Activation

Apply chain rule through activation function:

\[\frac{\partial L}{\partial z^{(L)}} = \frac{\partial L}{\partial a^{(L)}} \times f'(z^{(L)})\]
Example with Sigmoid:
  • If a = σ(z), then ∂a/∂z = σ(z)(1 - σ(z))
  • ∂L/∂z = (∂L/∂a) × σ(z)(1 - σ(z))

3. Gradient for Weights and Biases

Compute gradients for parameters:

\[\frac{\partial L}{\partial W^{(L)}} = \frac{\partial L}{\partial z^{(L)}} \times (a^{(L-1)})^T\] \[\frac{\partial L}{\partial b^{(L)}} = \sum \frac{\partial L}{\partial z^{(L)}}\]
Intuition:
  • Weight gradient: How much did each input contribute to the error?
  • Bias gradient: Sum of errors (bias affects all outputs equally)

4. Propagate to Previous Layer

Compute error for previous layer:

\[\frac{\partial L}{\partial a^{(L-1)}} = (W^{(L)})^T \times \frac{\partial L}{\partial z^{(L)}}\]
This becomes the error signal for layer L-1:
  • Weights transpose: "distribute" error back to previous layer
  • This error signal is used to compute gradients for layer L-1
  • Process repeats for all layers

Complete Example: 2-Layer Network

Network: 2 inputs → 3 hidden → 1 output

Forward Pass (already computed):

  • a⁽⁰⁾ = [0.5, 0.8] (input)
  • z⁽¹⁾ = [0.39, 0.62, 0.85] (hidden pre-activation)
  • a⁽¹⁾ = [0.39, 0.62, 0.85] (hidden activation, using ReLU)
  • z⁽²⁾ = 1.076 (output pre-activation)
  • a⁽²⁾ = 0.746 (output, using sigmoid)

True value: y_true = 1.0

Backward Pass:

  1. Output layer error: ∂L/∂a⁽²⁾ = a⁽²⁾ - y_true = 0.746 - 1.0 = -0.254
  2. Through sigmoid: ∂L/∂z⁽²⁾ = -0.254 × 0.746(1-0.746) = -0.254 × 0.190 = -0.048
  3. Weight gradient: ∂L/∂W⁽²⁾ = -0.048 × a⁽¹⁾ = -0.048 × [0.39, 0.62, 0.85]
  4. Bias gradient: ∂L/∂b⁽²⁾ = -0.048
  5. Propagate to hidden: ∂L/∂a⁽¹⁾ = W⁽²⁾ᵀ × (-0.048) = [0.4, 0.5, 0.6]ᵀ × (-0.048)
  6. Continue for layer 1...

Gradient Computation: The Math

Complete Gradient Formulas

For a network with L layers:

Output Layer (L):

\[\delta^{(L)} = \frac{\partial L}{\partial a^{(L)}} \odot f'^{(L)}(z^{(L)})\]

Where ⊙ is element-wise multiplication

Hidden Layers (l = L-1, L-2, ..., 1):

\[\delta^{(l)} = ((W^{(l+1)})^T \times \delta^{(l+1)}) \odot f'^{(l)}(z^{(l)})\]

Weight Gradients:

\[\frac{\partial L}{\partial W^{(l)}} = \delta^{(l)} \times (a^{(l-1)})^T\]

Bias Gradients:

\[\frac{\partial L}{\partial b^{(l)}} = \delta^{(l)} \quad \text{(sum over batch dimension)}\]
Notation:
  • δ⁽ˡ⁾: Error signal (gradient w.r.t. pre-activation) for layer l
  • f'⁽ˡ⁾: Derivative of activation function for layer l
  • : Element-wise (Hadamard) product

Backpropagation Implementation

import numpy as np

class NeuralNetwork:
    """Neural Network with Backpropagation"""
    
    def __init__(self, layer_sizes, activation='relu'):
        self.layer_sizes = layer_sizes
        self.activation = activation
        self.weights = []
        self.biases = []
        
        # Initialize weights (He initialization for ReLU)
        for i in range(len(layer_sizes) - 1):
            W = np.random.randn(layer_sizes[i+1], layer_sizes[i]) * np.sqrt(2.0 / layer_sizes[i])
            b = np.zeros((layer_sizes[i+1], 1))
            self.weights.append(W)
            self.biases.append(b)
    
    def _activate(self, z):
        """Activation function"""
        if self.activation == 'relu':
            return np.maximum(0, z)
        elif self.activation == 'sigmoid':
            z = np.clip(z, -250, 250)
            return 1 / (1 + np.exp(-z))
        return z
    
    def _activate_derivative(self, z):
        """Derivative of activation function"""
        if self.activation == 'relu':
            return (z > 0).astype(float)
        elif self.activation == 'sigmoid':
            s = self._activate(z)
            return s * (1 - s)
        return np.ones_like(z)
    
    def forward(self, X):
        """Forward propagation"""
        activations = [X]
        z_values = []
        current = X
        
        for W, b in zip(self.weights, self.biases):
            z = np.dot(W, current) + b
            z_values.append(z)
            a = self._activate(z)
            activations.append(a)
            current = a
        
        return activations, z_values
    
    def backward(self, X, y_true, activations, z_values):
        """
        Backpropagation
        
        Returns gradients for weights and biases
        """
        m = X.shape[1]  # batch size
        grads_W = []
        grads_b = []
        
        # Output layer error
        a_L = activations[-1]
        delta = a_L - y_true  # For MSE loss
        
        # Backpropagate through layers
        for l in range(len(self.weights) - 1, -1, -1):
            # Gradient through activation
            delta = delta * self._activate_derivative(z_values[l])
            
            # Weight and bias gradients
            grad_W = np.dot(delta, activations[l].T) / m
            grad_b = np.sum(delta, axis=1, keepdims=True) / m
            
            grads_W.insert(0, grad_W)
            grads_b.insert(0, grad_b)
            
            # Propagate error to previous layer (if not first layer)
            if l > 0:
                delta = np.dot(self.weights[l].T, delta)
        
        return grads_W, grads_b
    
    def update_weights(self, grads_W, grads_b, learning_rate=0.01):
        """Update weights using gradients"""
        for i in range(len(self.weights)):
            self.weights[i] -= learning_rate * grads_W[i]
            self.biases[i] -= learning_rate * grads_b[i]
    
    def train_step(self, X, y_true, learning_rate=0.01):
        """One training step: forward + backward + update"""
        activations, z_values = self.forward(X)
        grads_W, grads_b = self.backward(X, y_true, activations, z_values)
        self.update_weights(grads_W, grads_b, learning_rate)
        
        # Return loss
        loss = np.mean((activations[-1] - y_true)**2)
        return loss

# Example usage
network = NeuralNetwork([2, 3, 1], activation='sigmoid')
X = np.array([[0.5, 0.8]]).T
y = np.array([[1.0]])

# Training loop
for epoch in range(100):
    loss = network.train_step(X, y, learning_rate=0.1)
    if epoch % 20 == 0:
        print(f"Epoch {epoch}, Loss: {loss:.4f}")

Weight Updates: Gradient Descent

⬇️ Minimizing the Loss

Once we have gradients, we update weights to reduce the loss. We move in the direction opposite to the gradient (gradient descent).

Gradient Descent Update Rule

\[W \leftarrow W - \eta \times \frac{\partial L}{\partial W}\] \[b \leftarrow b - \eta \times \frac{\partial L}{\partial b}\]

Where η (eta) is the learning rate

Why Subtract?
  • Gradient points uphill: Direction of steepest increase
  • We want to go downhill: Direction of steepest decrease
  • Solution: Move in opposite direction (subtract gradient)

📚 Analogy: Finding the Bottom of a Valley

Imagine you're blindfolded in a valley and want to reach the bottom:

  • Loss function: Height of the valley (we want to minimize it)
  • Gradient: Direction of steepest uphill (we feel the slope)
  • Gradient descent: Walk in opposite direction (downhill)
  • Learning rate: Step size (how big steps you take)

Too small steps: Takes forever to reach bottom

Too large steps: Might overshoot and miss the bottom

Complete Training Loop

import numpy as np

def train_network(network, X_train, y_train, epochs=1000, learning_rate=0.01):
    """
    Complete training loop
    
    Parameters:
    network: NeuralNetwork instance
    X_train: Training inputs (n_features, n_samples)
    y_train: Training targets (n_outputs, n_samples)
    epochs: Number of training iterations
    learning_rate: Step size for gradient descent
    """
    losses = []
    
    for epoch in range(epochs):
        # Forward pass
        activations, z_values = network.forward(X_train)
        
        # Compute loss
        loss = np.mean((activations[-1] - y_train)**2)
        losses.append(loss)
        
        # Backward pass
        grads_W, grads_b = network.backward(X_train, y_train, activations, z_values)
        
        # Update weights
        network.update_weights(grads_W, grads_b, learning_rate)
        
        # Print progress
        if epoch % 100 == 0:
            print(f"Epoch {epoch}, Loss: {loss:.6f}")
    
    return losses

# Example: Learn XOR function
X = np.array([[0, 0, 1, 1],
              [0, 1, 0, 1]]).T.T  # (2, 4)
y = np.array([[0, 1, 1, 0]])  # (1, 4) - XOR output

network = NeuralNetwork([2, 4, 1], activation='sigmoid')
losses = train_network(network, X, y, epochs=1000, learning_rate=0.5)

# Test
predictions = network.forward(X)[0][-1]
print("\nPredictions:", predictions.flatten())
print("True values:", y.flatten())

Complete Implementation

Full Backpropagation Implementation

import numpy as np

class FeedforwardNetwork:
    """Complete Feedforward Network with Backpropagation"""
    
    def __init__(self, layer_sizes, activation='relu', init_method='he'):
        self.layer_sizes = layer_sizes
        self.activation = activation
        self.weights = []
        self.biases = []
        
        # Initialize
        for i in range(len(layer_sizes) - 1):
            n_in, n_out = layer_sizes[i], layer_sizes[i+1]
            if init_method == 'he':
                W = np.random.randn(n_out, n_in) * np.sqrt(2.0 / n_in)
            else:
                W = np.random.randn(n_out, n_in) * 0.01
            b = np.zeros((n_out, 1))
            self.weights.append(W)
            self.biases.append(b)
    
    def _activate(self, z):
        if self.activation == 'relu':
            return np.maximum(0, z)
        elif self.activation == 'sigmoid':
            z = np.clip(z, -250, 250)
            return 1 / (1 + np.exp(-z))
        return z
    
    def _activate_derivative(self, z):
        if self.activation == 'relu':
            return (z > 0).astype(float)
        elif self.activation == 'sigmoid':
            s = self._activate(z)
            return s * (1 - s)
        return np.ones_like(z)
    
    def forward(self, X):
        """Forward propagation"""
        activations = [X]
        z_values = []
        current = X
        
        for W, b in zip(self.weights, self.biases):
            z = np.dot(W, current) + b
            z_values.append(z)
            a = self._activate(z)
            activations.append(a)
            current = a
        
        return activations, z_values
    
    def backward(self, X, y_true, activations, z_values, loss='mse'):
        """Backpropagation"""
        m = X.shape[1]  # batch size
        grads_W = []
        grads_b = []
        
        # Output layer error (depends on loss function)
        a_L = activations[-1]
        if loss == 'mse':
            delta = (a_L - y_true) / m
        elif loss == 'cross_entropy':
            # For cross-entropy with sigmoid output
            delta = (a_L - y_true) / m
        
        # Backpropagate
        for l in range(len(self.weights) - 1, -1, -1):
            # Gradient through activation
            delta = delta * self._activate_derivative(z_values[l])
            
            # Compute gradients
            grad_W = np.dot(delta, activations[l].T)
            grad_b = np.sum(delta, axis=1, keepdims=True)
            
            grads_W.insert(0, grad_W)
            grads_b.insert(0, grad_b)
            
            # Propagate to previous layer
            if l > 0:
                delta = np.dot(self.weights[l].T, delta)
        
        return grads_W, grads_b
    
    def update(self, grads_W, grads_b, learning_rate):
        """Update weights"""
        for i in range(len(self.weights)):
            self.weights[i] -= learning_rate * grads_W[i]
            self.biases[i] -= learning_rate * grads_b[i]
    
    def train(self, X, y, epochs=1000, learning_rate=0.01, verbose=True):
        """Training loop"""
        losses = []
        for epoch in range(epochs):
            activations, z_values = self.forward(X)
            loss = np.mean((activations[-1] - y)**2)
            losses.append(loss)
            
            grads_W, grads_b = self.backward(X, y, activations, z_values)
            self.update(grads_W, grads_b, learning_rate)
            
            if verbose and epoch % (epochs // 10) == 0:
                print(f"Epoch {epoch}: Loss = {loss:.6f}")
        
        return losses

# Usage example
np.random.seed(42)
network = FeedforwardNetwork([2, 4, 1], activation='sigmoid')
X = np.array([[0, 0, 1, 1], [0, 1, 0, 1]])
y = np.array([[0, 1, 1, 0]])
losses = network.train(X, y, epochs=1000, learning_rate=0.5)

Test Your Understanding

Question 1: Why is backpropagation called "backward" propagation?

A) Because it's slower than forward propagation
B) Because it computes gradients starting from the output layer and moving backward to the input
C) Because it reverses the network architecture
D) Because it's done after forward propagation

Question 2: What mathematical principle is backpropagation based on?

A) Chain rule from calculus
B) Matrix multiplication
C) Linear algebra
D) Probability theory

Question 3: In gradient descent, why do we subtract the gradient from weights?

A) To increase the loss
B) To move in the direction that decreases the loss
C) To normalize the weights
D) To prevent overfitting

Question 4: Explain how backpropagation uses the chain rule.

A) The chain rule lets us compute gradients layer by layer. We start with the output layer gradient, then multiply by each layer's derivative going backward, chaining the gradients together to get the final weight updates
B) It multiplies all gradients at once
C) It only uses the output gradient
D) It doesn't use chain rule

Question 5: What happens if the learning rate is too large during gradient descent?

A) The network overshoots the optimal weights, causing loss to increase or oscillate, and may never converge
B) Training becomes faster
C) Nothing changes
D) It prevents overfitting

Question 6: How do you compute the gradient for a weight in the hidden layer?

A) Multiply the error from the next layer by the activation derivative and the input to that weight, then propagate backward using chain rule
B) Use only the output error
C) Random value
D) Same as output layer

Question 7: What is the formula for updating weights using gradient descent?

A) \(W_{new} = W_{old} - \alpha \frac{\partial L}{\partial W}\) where alpha is learning rate and L is loss
B) \(W_{new} = W_{old} + \alpha\)
C) \(W_{new} = \alpha \times W_{old}\)
D) \(W_{new} = W_{old} \times \frac{\partial L}{\partial W}\)

Question 8: Why do we need to store activations during forward propagation for backpropagation?

A) We need the activations to compute the derivatives of the activation functions and to multiply with gradients when applying the chain rule
B) They're not needed
C) Only for debugging
D) To save memory

Question 9: What is the difference between batch gradient descent and stochastic gradient descent?

A) Batch uses all training examples to compute gradient before updating weights, while stochastic updates weights after each single example, making it faster but noisier
B) They're the same
C) Batch is always better
D) Stochastic uses all data

Question 10: How does backpropagation handle bias gradients?

A) Bias gradients are computed similarly to weight gradients, but since bias is added (not multiplied), its gradient is just the error signal from the next layer
B) Biases don't have gradients
C) Same as weight gradients
D) Biases are not updated

Question 11: What causes vanishing gradients in deep networks?

A) When gradients become extremely small as they propagate backward through many layers, often due to activation functions with small derivatives like sigmoid, causing early layers to learn very slowly
B) Too many neurons
C) Large learning rate
D) Too much data

Question 12: How would you debug a network that's not learning during backpropagation?

A) Check gradient magnitudes (should not be zero or extremely small), verify loss is decreasing, check learning rate, inspect weight updates, ensure activations are in reasonable range, verify backprop implementation is correct
B) Increase learning rate
C) Add more layers
D) Use more data