Chapter 4: Backpropagation Algorithm
The Learning Mechanism - Understanding how neural networks learn by computing gradients and updating weights
Learning Objectives
- Understand the chain rule and its role in backpropagation
- Master loss functions and error computation
- Learn the backward pass step-by-step
- Understand gradient computation for each layer
- Implement backpropagation from scratch
- Understand vanishing and exploding gradients
What is Backpropagation?
The Learning Algorithm
Backpropagation (backward propagation of errors) is the algorithm that enables neural networks to learn. It computes gradients of the loss function with respect to each weight and bias, allowing the network to update its parameters and improve predictions.
The Process:
- Forward Pass: Compute predictions using current weights
- Compute Loss: Measure how wrong predictions are
- Backward Pass: Compute gradients (how to change weights)
- Update Weights: Adjust weights to reduce loss
- Repeat: Continue until loss is minimized
📚 Real-World Analogy: Learning to Throw
Think of learning to throw a ball into a basket:
- Forward Pass: You throw the ball (make a prediction)
- Loss: You see how far off you were (measure error)
- Backpropagation: You figure out what to adjust (angle, force, etc.)
- Update: You adjust your throwing technique
- Repeat: Keep practicing until you get it right!
Key Insight: Backpropagation tells you exactly how to adjust each "muscle" (weight) to improve your "throw" (prediction)!
Why "Backward"?
Backpropagation works backward through the network because:
- Error starts at output: We know the error at the final layer
- Propagate backward: We compute how much each previous layer contributed to the error
- Chain rule: Errors flow backward through the network using calculus
- Efficient: One backward pass computes all gradients at once
The Chain Rule: Foundation of Backpropagation
🔗 Mathematical Foundation
The chain rule from calculus is the mathematical foundation of backpropagation. It allows us to compute derivatives of composite functions, which is exactly what neural networks are!
Chain Rule for Single Variable
If \(y = f(g(x))\), then:
\[\frac{dy}{dx} = \frac{df}{dg} \times \frac{dg}{dx}\]Intuition:
To find how y changes with x, we need to know:
- How y changes with g (df/dg)
- How g changes with x (dg/dx)
Multiply them together!
Simple Example
Example: y = (2x + 1)³
Let g = 2x + 1, so y = g³
Using chain rule:
- dy/dg = 3g² = 3(2x + 1)²
- dg/dx = 2
- dy/dx = (dy/dg) × (dg/dx) = 3(2x + 1)² × 2 = 6(2x + 1)²
Verification: Direct differentiation gives the same result!
Chain Rule for Multiple Variables
For neural networks, we need the multivariable chain rule:
If \(z = f(x, y)\) where \(x = g(t)\) and \(y = h(t)\), then:
\[\frac{dz}{dt} = \frac{\partial f}{\partial x} \frac{dx}{dt} + \frac{\partial f}{\partial y} \frac{dy}{dt}\]In Neural Networks:
- Each layer's output depends on previous layers
- Error propagates backward through all paths
- We sum contributions from all paths (sum rule)
Chain Rule Visualization
import numpy as np
# Example: Computing gradient through a simple network
# y = f(g(h(x)))
# where h(x) = x², g(u) = 2u + 1, f(v) = v³
def h(x):
return x**2
def g(u):
return 2*u + 1
def f(v):
return v**3
# Forward pass
x = 3
u = h(x) # u = 9
v = g(u) # v = 19
y = f(v) # y = 6859
# Backward pass (chain rule)
df_dv = 3 * v**2 # = 3 * 19² = 1083
dg_du = 2
dh_dx = 2 * x # = 2 * 3 = 6
# Chain rule: dy/dx = (df/dv) × (dg/du) × (dh/dx)
dy_dx = df_dv * dg_du * dh_dx
print(f"dy/dx = {dy_dx}")
# Verification: y = (2x² + 1)³
# dy/dx = 3(2x² + 1)² × 4x = 12x(2x² + 1)²
# At x=3: 12×3×(2×9+1)² = 36×19² = 12996 ✓
Loss Functions: Measuring Error
📏 What is a Loss Function?
A loss function measures how wrong our predictions are. It quantifies the difference between predicted and actual values. The goal of training is to minimize this loss.
Common Loss Functions
1. Mean Squared Error (MSE) - For Regression
Properties:
- Always positive: Squared terms ensure L ≥ 0
- Penalizes large errors: Squaring amplifies big mistakes
- Differentiable: Smooth curve, easy to optimize
- Best for: Regression tasks (predicting continuous values)
2. Cross-Entropy Loss - For Classification
Binary Cross-Entropy:
\[L = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_{\text{true}} \log(y_{\text{pred}}) + (1-y_{\text{true}}) \log(1-y_{\text{pred}}) \right]\]Multi-class Cross-Entropy:
\[L = -\frac{1}{n} \sum_{i=1}^{n} \sum_{c=1}^{C} y_{\text{true},c} \log(y_{\text{pred},c})\]Why Cross-Entropy?
- Works with probabilities: Designed for classification
- Large penalty for confident wrong predictions: If y_pred ≈ 0 but y_true = 1, loss → ∞
- Small penalty for correct predictions: If y_pred ≈ y_true, loss → 0
- Best for: Classification tasks
Loss Function Implementation
import numpy as np
def mse_loss(y_pred, y_true):
"""
Mean Squared Error
Parameters:
y_pred: Predicted values (n_samples,)
y_true: True values (n_samples,)
"""
return np.mean((y_pred - y_true)**2)
def mse_loss_derivative(y_pred, y_true):
"""Derivative of MSE with respect to y_pred"""
return 2 * (y_pred - y_true) / len(y_pred)
def binary_cross_entropy(y_pred, y_true, epsilon=1e-15):
"""
Binary Cross-Entropy Loss
Parameters:
y_pred: Predicted probabilities (n_samples,)
y_true: True labels (0 or 1) (n_samples,)
epsilon: Small value to prevent log(0)
"""
# Clip to prevent log(0)
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
def binary_cross_entropy_derivative(y_pred, y_true, epsilon=1e-15):
"""Derivative of BCE with respect to y_pred"""
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
return -(y_true / y_pred - (1 - y_true) / (1 - y_pred)) / len(y_pred)
def cross_entropy(y_pred, y_true, epsilon=1e-15):
"""
Multi-class Cross-Entropy Loss
Parameters:
y_pred: Predicted probabilities (n_samples, n_classes)
y_true: True labels (one-hot encoded) (n_samples, n_classes)
"""
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
return -np.mean(np.sum(y_true * np.log(y_pred), axis=1))
# Example usage
y_pred_reg = np.array([2.1, 3.5, 1.8])
y_true_reg = np.array([2.0, 3.0, 2.0])
print(f"MSE: {mse_loss(y_pred_reg, y_true_reg):.4f}")
y_pred_cls = np.array([0.9, 0.1, 0.8])
y_true_cls = np.array([1, 0, 1])
print(f"BCE: {binary_cross_entropy(y_pred_cls, y_true_cls):.4f}")
The Backward Pass: Computing Gradients
⬅️ Working Backward Through Layers
The backward pass computes gradients starting from the output layer and moving backward to the input layer. We use the chain rule to propagate errors through each layer.
Step-by-Step Backward Pass
1. Output Layer Gradient
Start with the loss gradient at the output:
For MSE Loss:
- L = (1/2)(y_pred - y_true)²
- ∂L/∂y_pred = y_pred - y_true (error signal!)
2. Gradient Through Activation
Apply chain rule through activation function:
Example with Sigmoid:
- If a = σ(z), then ∂a/∂z = σ(z)(1 - σ(z))
- ∂L/∂z = (∂L/∂a) × σ(z)(1 - σ(z))
3. Gradient for Weights and Biases
Compute gradients for parameters:
Intuition:
- Weight gradient: How much did each input contribute to the error?
- Bias gradient: Sum of errors (bias affects all outputs equally)
4. Propagate to Previous Layer
Compute error for previous layer:
This becomes the error signal for layer L-1:
- Weights transpose: "distribute" error back to previous layer
- This error signal is used to compute gradients for layer L-1
- Process repeats for all layers
Complete Example: 2-Layer Network
Network: 2 inputs → 3 hidden → 1 output
Forward Pass (already computed):
- a⁽⁰⁾ = [0.5, 0.8] (input)
- z⁽¹⁾ = [0.39, 0.62, 0.85] (hidden pre-activation)
- a⁽¹⁾ = [0.39, 0.62, 0.85] (hidden activation, using ReLU)
- z⁽²⁾ = 1.076 (output pre-activation)
- a⁽²⁾ = 0.746 (output, using sigmoid)
True value: y_true = 1.0
Backward Pass:
- Output layer error: ∂L/∂a⁽²⁾ = a⁽²⁾ - y_true = 0.746 - 1.0 = -0.254
- Through sigmoid: ∂L/∂z⁽²⁾ = -0.254 × 0.746(1-0.746) = -0.254 × 0.190 = -0.048
- Weight gradient: ∂L/∂W⁽²⁾ = -0.048 × a⁽¹⁾ = -0.048 × [0.39, 0.62, 0.85]
- Bias gradient: ∂L/∂b⁽²⁾ = -0.048
- Propagate to hidden: ∂L/∂a⁽¹⁾ = W⁽²⁾ᵀ × (-0.048) = [0.4, 0.5, 0.6]ᵀ × (-0.048)
- Continue for layer 1...
Gradient Computation: The Math
Complete Gradient Formulas
For a network with L layers:
Output Layer (L):
Where ⊙ is element-wise multiplication
Hidden Layers (l = L-1, L-2, ..., 1):
Weight Gradients:
Bias Gradients:
Notation:
- δ⁽ˡ⁾: Error signal (gradient w.r.t. pre-activation) for layer l
- f'⁽ˡ⁾: Derivative of activation function for layer l
- ⊙: Element-wise (Hadamard) product
Backpropagation Implementation
import numpy as np
class NeuralNetwork:
"""Neural Network with Backpropagation"""
def __init__(self, layer_sizes, activation='relu'):
self.layer_sizes = layer_sizes
self.activation = activation
self.weights = []
self.biases = []
# Initialize weights (He initialization for ReLU)
for i in range(len(layer_sizes) - 1):
W = np.random.randn(layer_sizes[i+1], layer_sizes[i]) * np.sqrt(2.0 / layer_sizes[i])
b = np.zeros((layer_sizes[i+1], 1))
self.weights.append(W)
self.biases.append(b)
def _activate(self, z):
"""Activation function"""
if self.activation == 'relu':
return np.maximum(0, z)
elif self.activation == 'sigmoid':
z = np.clip(z, -250, 250)
return 1 / (1 + np.exp(-z))
return z
def _activate_derivative(self, z):
"""Derivative of activation function"""
if self.activation == 'relu':
return (z > 0).astype(float)
elif self.activation == 'sigmoid':
s = self._activate(z)
return s * (1 - s)
return np.ones_like(z)
def forward(self, X):
"""Forward propagation"""
activations = [X]
z_values = []
current = X
for W, b in zip(self.weights, self.biases):
z = np.dot(W, current) + b
z_values.append(z)
a = self._activate(z)
activations.append(a)
current = a
return activations, z_values
def backward(self, X, y_true, activations, z_values):
"""
Backpropagation
Returns gradients for weights and biases
"""
m = X.shape[1] # batch size
grads_W = []
grads_b = []
# Output layer error
a_L = activations[-1]
delta = a_L - y_true # For MSE loss
# Backpropagate through layers
for l in range(len(self.weights) - 1, -1, -1):
# Gradient through activation
delta = delta * self._activate_derivative(z_values[l])
# Weight and bias gradients
grad_W = np.dot(delta, activations[l].T) / m
grad_b = np.sum(delta, axis=1, keepdims=True) / m
grads_W.insert(0, grad_W)
grads_b.insert(0, grad_b)
# Propagate error to previous layer (if not first layer)
if l > 0:
delta = np.dot(self.weights[l].T, delta)
return grads_W, grads_b
def update_weights(self, grads_W, grads_b, learning_rate=0.01):
"""Update weights using gradients"""
for i in range(len(self.weights)):
self.weights[i] -= learning_rate * grads_W[i]
self.biases[i] -= learning_rate * grads_b[i]
def train_step(self, X, y_true, learning_rate=0.01):
"""One training step: forward + backward + update"""
activations, z_values = self.forward(X)
grads_W, grads_b = self.backward(X, y_true, activations, z_values)
self.update_weights(grads_W, grads_b, learning_rate)
# Return loss
loss = np.mean((activations[-1] - y_true)**2)
return loss
# Example usage
network = NeuralNetwork([2, 3, 1], activation='sigmoid')
X = np.array([[0.5, 0.8]]).T
y = np.array([[1.0]])
# Training loop
for epoch in range(100):
loss = network.train_step(X, y, learning_rate=0.1)
if epoch % 20 == 0:
print(f"Epoch {epoch}, Loss: {loss:.4f}")
Weight Updates: Gradient Descent
⬇️ Minimizing the Loss
Once we have gradients, we update weights to reduce the loss. We move in the direction opposite to the gradient (gradient descent).
Gradient Descent Update Rule
Where η (eta) is the learning rate
Why Subtract?
- Gradient points uphill: Direction of steepest increase
- We want to go downhill: Direction of steepest decrease
- Solution: Move in opposite direction (subtract gradient)
📚 Analogy: Finding the Bottom of a Valley
Imagine you're blindfolded in a valley and want to reach the bottom:
- Loss function: Height of the valley (we want to minimize it)
- Gradient: Direction of steepest uphill (we feel the slope)
- Gradient descent: Walk in opposite direction (downhill)
- Learning rate: Step size (how big steps you take)
Too small steps: Takes forever to reach bottom
Too large steps: Might overshoot and miss the bottom
Complete Training Loop
import numpy as np
def train_network(network, X_train, y_train, epochs=1000, learning_rate=0.01):
"""
Complete training loop
Parameters:
network: NeuralNetwork instance
X_train: Training inputs (n_features, n_samples)
y_train: Training targets (n_outputs, n_samples)
epochs: Number of training iterations
learning_rate: Step size for gradient descent
"""
losses = []
for epoch in range(epochs):
# Forward pass
activations, z_values = network.forward(X_train)
# Compute loss
loss = np.mean((activations[-1] - y_train)**2)
losses.append(loss)
# Backward pass
grads_W, grads_b = network.backward(X_train, y_train, activations, z_values)
# Update weights
network.update_weights(grads_W, grads_b, learning_rate)
# Print progress
if epoch % 100 == 0:
print(f"Epoch {epoch}, Loss: {loss:.6f}")
return losses
# Example: Learn XOR function
X = np.array([[0, 0, 1, 1],
[0, 1, 0, 1]]).T.T # (2, 4)
y = np.array([[0, 1, 1, 0]]) # (1, 4) - XOR output
network = NeuralNetwork([2, 4, 1], activation='sigmoid')
losses = train_network(network, X, y, epochs=1000, learning_rate=0.5)
# Test
predictions = network.forward(X)[0][-1]
print("\nPredictions:", predictions.flatten())
print("True values:", y.flatten())
Complete Implementation
Full Backpropagation Implementation
import numpy as np
class FeedforwardNetwork:
"""Complete Feedforward Network with Backpropagation"""
def __init__(self, layer_sizes, activation='relu', init_method='he'):
self.layer_sizes = layer_sizes
self.activation = activation
self.weights = []
self.biases = []
# Initialize
for i in range(len(layer_sizes) - 1):
n_in, n_out = layer_sizes[i], layer_sizes[i+1]
if init_method == 'he':
W = np.random.randn(n_out, n_in) * np.sqrt(2.0 / n_in)
else:
W = np.random.randn(n_out, n_in) * 0.01
b = np.zeros((n_out, 1))
self.weights.append(W)
self.biases.append(b)
def _activate(self, z):
if self.activation == 'relu':
return np.maximum(0, z)
elif self.activation == 'sigmoid':
z = np.clip(z, -250, 250)
return 1 / (1 + np.exp(-z))
return z
def _activate_derivative(self, z):
if self.activation == 'relu':
return (z > 0).astype(float)
elif self.activation == 'sigmoid':
s = self._activate(z)
return s * (1 - s)
return np.ones_like(z)
def forward(self, X):
"""Forward propagation"""
activations = [X]
z_values = []
current = X
for W, b in zip(self.weights, self.biases):
z = np.dot(W, current) + b
z_values.append(z)
a = self._activate(z)
activations.append(a)
current = a
return activations, z_values
def backward(self, X, y_true, activations, z_values, loss='mse'):
"""Backpropagation"""
m = X.shape[1] # batch size
grads_W = []
grads_b = []
# Output layer error (depends on loss function)
a_L = activations[-1]
if loss == 'mse':
delta = (a_L - y_true) / m
elif loss == 'cross_entropy':
# For cross-entropy with sigmoid output
delta = (a_L - y_true) / m
# Backpropagate
for l in range(len(self.weights) - 1, -1, -1):
# Gradient through activation
delta = delta * self._activate_derivative(z_values[l])
# Compute gradients
grad_W = np.dot(delta, activations[l].T)
grad_b = np.sum(delta, axis=1, keepdims=True)
grads_W.insert(0, grad_W)
grads_b.insert(0, grad_b)
# Propagate to previous layer
if l > 0:
delta = np.dot(self.weights[l].T, delta)
return grads_W, grads_b
def update(self, grads_W, grads_b, learning_rate):
"""Update weights"""
for i in range(len(self.weights)):
self.weights[i] -= learning_rate * grads_W[i]
self.biases[i] -= learning_rate * grads_b[i]
def train(self, X, y, epochs=1000, learning_rate=0.01, verbose=True):
"""Training loop"""
losses = []
for epoch in range(epochs):
activations, z_values = self.forward(X)
loss = np.mean((activations[-1] - y)**2)
losses.append(loss)
grads_W, grads_b = self.backward(X, y, activations, z_values)
self.update(grads_W, grads_b, learning_rate)
if verbose and epoch % (epochs // 10) == 0:
print(f"Epoch {epoch}: Loss = {loss:.6f}")
return losses
# Usage example
np.random.seed(42)
network = FeedforwardNetwork([2, 4, 1], activation='sigmoid')
X = np.array([[0, 0, 1, 1], [0, 1, 0, 1]])
y = np.array([[0, 1, 1, 0]])
losses = network.train(X, y, epochs=1000, learning_rate=0.5)