Chapter 2: Feedforward Networks & Forward Propagation
Understanding how information flows through neural networks layer by layer
Learning Objectives
- Understand the architecture of feedforward neural networks
- Master forward propagation step-by-step
- Learn matrix operations in neural networks
- Understand weight initialization strategies
- Implement forward propagation from scratch
- Visualize information flow through layers
What is a Feedforward Network?
Information Flow: One Direction Only
A feedforward neural network is called "feedforward" because information flows in only one direction: from input → hidden layers → output. There are no loops or cycles - data moves forward through the network like water flowing down a river.
Key Characteristics:
- Unidirectional: Information flows from input to output only
- No Feedback: Outputs don't feed back into earlier layers
- Layered Structure: Organized into distinct layers (input, hidden, output)
- Fully Connected: Each neuron connects to all neurons in the next layer
📚 Real-World Analogy: Assembly Line
Think of a feedforward network like an assembly line in a factory:
- Input Layer: Raw materials arrive (like car parts)
- Hidden Layer 1: First processing station (workers assemble engine)
- Hidden Layer 2: Second processing station (workers add body)
- Output Layer: Final product (complete car)
Key Point: Just like an assembly line, information moves in one direction only - you can't go backwards! Each layer processes the output from the previous layer and passes it forward.
Why "Feedforward"?
The term "feedforward" distinguishes these networks from other types:
| Network Type | Information Flow | Example |
|---|---|---|
| Feedforward | Input → Hidden → Output (one direction) | Image classification |
| Recurrent (RNN) | Has loops, information cycles back | Text generation, time series |
| Convolutional (CNN) | Feedforward with special layer types | Image recognition |
Understanding Network Layers
🏗️ The Building Blocks
A neural network is organized into layers, each serving a specific purpose:
1. Input Layer
- Purpose: Receives the raw input data
- Size: Number of input features (e.g., 784 for 28×28 images)
- No Computation: Just passes data to the next layer
- Example: For house price prediction, inputs might be: [size, bedrooms, age, location]
2. Hidden Layers
- Purpose: Learn complex patterns and feature combinations
- Number: Can have 1 to 100+ hidden layers (depth)
- Size: Number of neurons per layer (width)
- Computation: Performs weighted sums and activations
- Example: A hidden layer might learn: "houses with 3+ bedrooms AND size > 2000 sqft are expensive"
3. Output Layer
- Purpose: Produces the final prediction
- Size: Depends on task (1 for regression, N for N-class classification)
- Activation: Different from hidden layers (sigmoid for binary, softmax for multi-class)
- Example: For classification: [0.1, 0.8, 0.1] means 80% confidence in class 2
Layer Notation
We use superscripts to denote layers:
\[a^{(1)} = \text{First hidden layer output}\]
\[a^{(2)} = \text{Second hidden layer output}\]
\[a^{(L)} = \text{Output layer (L = number of layers)}\]
Why This Notation?
- a stands for "activation" (the output of a layer)
- Superscript number tells us which layer we're talking about
- a⁽⁰⁾ is special - it's the input, not computed by the network
- For a 3-layer network: a⁽⁰⁾ → a⁽¹⁾ → a⁽²⁾ (input → hidden → output)
Concrete Example: 3-Layer Network
Architecture: 4 inputs → 5 hidden neurons → 3 outputs
Layer Sizes:
- Input Layer (a⁽⁰⁾): 4 neurons (e.g., [price, size, bedrooms, age])
- Hidden Layer (a⁽¹⁾): 5 neurons (learns feature combinations)
- Output Layer (a⁽²⁾): 3 neurons (e.g., [low_price, medium_price, high_price])
Total Parameters:
- Weights from input to hidden: 4 × 5 = 20 weights
- Biases for hidden layer: 5 biases
- Weights from hidden to output: 5 × 3 = 15 weights
- Biases for output layer: 3 biases
- Total: 20 + 5 + 15 + 3 = 43 parameters
Forward Propagation: Step by Step
What is Forward Propagation?
Forward propagation is the process of passing input data through the network to compute the output. It's called "forward" because we move from input to output, computing each layer's activations in sequence.
The Process:
- Start with input features
- For each layer, compute weighted sum + bias
- Apply activation function
- Use this output as input to next layer
- Repeat until reaching output layer
Mathematical Formulation
Forward Propagation Formula
For each layer l = 1, 2, ..., L:
Step 1: Compute Pre-activation (Weighted Sum)
\[z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)}\]Step 2: Apply Activation Function
\[a^{(l)} = f(z^{(l)})\]Detailed Breakdown:
- z⁽ˡ⁾: Pre-activation vector (before applying activation function)
- W⁽ˡ⁾: Weight matrix for layer l (rows = neurons in layer l, columns = neurons in layer l-1)
- a⁽ˡ⁻¹⁾: Activations from previous layer (input to current layer)
- b⁽ˡ⁾: Bias vector for layer l
- f(·): Activation function (ReLU, sigmoid, tanh, etc.)
- a⁽ˡ⁾: Final activations (output of layer l)
Step-by-Step Example
Simple 2-Layer Network: 2 inputs → 3 hidden → 1 output
Input: x = [0.5, 0.8]
Layer 1 (Hidden Layer):
Weight matrix W⁽¹⁾ (3×2):
W⁽¹⁾ = [0.1 0.3]
[0.2 0.4]
[0.3 0.5]
Bias b⁽¹⁾ = [0.1, 0.2, 0.3]
Step 1: Compute z⁽¹⁾
z⁽¹⁾ = W⁽¹⁾x + b⁽¹⁾
z⁽¹⁾ = [0.1×0.5 + 0.3×0.8 + 0.1, 0.2×0.5 + 0.4×0.8 + 0.2, 0.3×0.5 + 0.5×0.8 + 0.3]
z⁽¹⁾ = [0.05 + 0.24 + 0.1, 0.10 + 0.32 + 0.2, 0.15 + 0.40 + 0.3]
z⁽¹⁾ = [0.39, 0.62, 0.85]
Step 2: Apply ReLU activation
a⁽¹⁾ = ReLU(z⁽¹⁾) = [max(0, 0.39), max(0, 0.62), max(0, 0.85)]
a⁽¹⁾ = [0.39, 0.62, 0.85]
Layer 2 (Output Layer):
Weight matrix W⁽²⁾ (1×3): W⁽²⁾ = [0.4, 0.5, 0.6]
Bias b⁽²⁾ = [0.1]
Step 1: Compute z⁽²⁾
z⁽²⁾ = W⁽²⁾a⁽¹⁾ + b⁽²⁾
z⁽²⁾ = 0.4×0.39 + 0.5×0.62 + 0.6×0.85 + 0.1
z⁽²⁾ = 0.156 + 0.310 + 0.510 + 0.1
z⁽²⁾ = 1.076
Step 2: Apply sigmoid activation
a⁽²⁾ = σ(z⁽²⁾) = 1 / (1 + e^(-1.076))
a⁽²⁾ ≈ 0.746
Final Output: 0.746 (74.6% confidence in positive class)
Why Forward Propagation Matters
Forward propagation is the foundation of neural network computation:
- Prediction: Used every time you make a prediction (inference)
- Training: Required before backpropagation (need to compute error)
- Understanding: Helps visualize how networks process information
- Debugging: Can check intermediate values to find problems
Matrix Operations in Neural Networks
Why Matrices?
Neural networks use matrix operations because they're incredibly efficient! Instead of computing each neuron one by one, we can process entire layers simultaneously using matrix multiplication.
Benefits of Matrix Operations:
- Parallelization: GPUs excel at matrix operations
- Efficiency: Optimized linear algebra libraries (BLAS, cuBLAS)
- Simplicity: Clean, concise code
- Speed: Can process thousands of examples at once (batch processing)
Matrix Dimensions
Understanding matrix dimensions is crucial:
For layer \(l\):
\[W^{(l)}: (n^{(l)} \times n^{(l-1)}) \text{ matrix}\] \[a^{(l-1)}: (n^{(l-1)} \times m) \text{ matrix (m = batch size)}\] \[b^{(l)}: (n^{(l)} \times 1) \text{ vector}\] \[z^{(l)}: (n^{(l)} \times m) \text{ matrix}\] a⁽ˡ⁾: (n⁽ˡ⁾ × m) matrixDimension Rules:
- n⁽ˡ⁾: Number of neurons in layer l
- m: Batch size (number of examples processed together)
- W⁽ˡ⁾ × a⁽ˡ⁻¹⁾: (n⁽ˡ⁾ × n⁽ˡ⁻¹⁾) × (n⁽ˡ⁻¹⁾ × m) = (n⁽ˡ⁾ × m) ✓
- Broadcasting: Bias b⁽ˡ⁾ is automatically broadcast to all examples in batch
Matrix Multiplication Example
Single Example (m=1):
Input: a⁽⁰⁾ = [2, 3]ᵀ (2×1 vector)
Weight Matrix W⁽¹⁾ (3×2):
W⁽¹⁾ = [w₁₁ w₁₂] = [0.1 0.2]
[w₂₁ w₂₂] [0.3 0.4]
[w₃₁ w₃₂] [0.5 0.6]
Bias: b⁽¹⁾ = [0.1, 0.2, 0.3]ᵀ
Computation:
z⁽¹⁾ = W⁽¹⁾a⁽⁰⁾ + b⁽¹⁾
z⁽¹⁾ = [0.1 0.2] [2] [0.1]
[0.3 0.4] [3] + [0.2]
[0.5 0.6] [0.3]
= [0.1×2 + 0.2×3] [0.1]
[0.3×2 + 0.4×3] + [0.2]
[0.5×2 + 0.6×3] [0.3]
= [0.8] [0.1] [0.9]
[1.8] + [0.2] = [2.0]
[2.8] [0.3] [3.1]
Batch Processing (m=3):
Process 3 examples at once:
Input Batch: a⁽⁰⁾ = [[2, 3], [1, 4], [3, 1]]ᵀ (2×3 matrix)
Computation:
z⁽¹⁾ = W⁽¹⁾a⁽⁰⁾ + b⁽¹⁾
Result: (3×3) matrix - each column is the output for one example!
Matrix Operations in NumPy
import numpy as np
# Example: Forward propagation with matrices
def forward_propagation(X, weights, biases, activation='relu'):
"""
Forward propagation through a neural network
Parameters:
X: Input data (n_features, n_samples)
weights: List of weight matrices
biases: List of bias vectors
activation: Activation function name
"""
activations = [X] # Store all layer activations
current_input = X
for W, b in zip(weights, biases):
# Matrix multiplication: z = W @ X + b
z = np.dot(W, current_input) + b
# Apply activation function
if activation == 'relu':
a = np.maximum(0, z)
elif activation == 'sigmoid':
a = 1 / (1 + np.exp(-np.clip(z, -250, 250)))
elif activation == 'tanh':
a = np.tanh(z)
else:
a = z
activations.append(a)
current_input = a # Output becomes input for next layer
return activations
# Example usage
# Input: 2 features, 4 samples
X = np.array([[1, 2, 3, 4],
[5, 6, 7, 8]])
# Layer 1: 2 inputs → 3 hidden neurons
W1 = np.random.randn(3, 2) * 0.1
b1 = np.zeros((3, 1))
# Layer 2: 3 hidden → 1 output
W2 = np.random.randn(1, 3) * 0.1
b2 = np.zeros((1, 1))
weights = [W1, W2]
biases = [b1, b2]
# Forward pass
activations = forward_propagation(X, weights, biases, activation='relu')
print("Output shape:", activations[-1].shape) # (1, 4) - 1 output for each of 4 samples
Code Explanation:
- np.dot(W, X): Matrix multiplication (more efficient than loops)
- Broadcasting: b automatically broadcasts to all samples
- Vectorized Operations: Activation function applied to entire matrix at once
- Batch Processing: Can process multiple examples simultaneously
Weight Initialization Strategies
🎲 Why Initialization Matters
Weight initialization is crucial for training neural networks! Starting with the wrong weights can cause:
- Vanishing Gradients: Weights too small → gradients shrink to zero
- Exploding Gradients: Weights too large → gradients explode
- Symmetry Breaking: All weights same → neurons learn same thing
- Slow Convergence: Poor initialization → takes forever to train
Common Initialization Methods
1. Random Initialization
Simple but often problematic:
Problems:
- Weights too large → activation outputs saturate
- Weights too small → gradients vanish
- No consideration of layer size
2. Xavier/Glorot Initialization
Designed for tanh and sigmoid activations:
or
\[W \sim \text{Uniform}\left(-\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}, \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}\right)\]Intuition:
- n_in: Number of inputs to the layer
- n_out: Number of outputs from the layer
- Goal: Keep variance of activations constant across layers
- Why it works: Prevents activations from growing or shrinking too much
3. He Initialization (for ReLU)
Designed specifically for ReLU activation:
Why Different from Xavier?
- ReLU sets half the outputs to zero (only positive values pass through)
- This halves the variance compared to symmetric activations
- He initialization compensates by doubling the variance (2/n_in vs 1/n_in)
- Result: Maintains variance through ReLU layers
Implementation
import numpy as np
def xavier_init(n_in, n_out):
"""Xavier/Glorot initialization for tanh/sigmoid"""
limit = np.sqrt(6.0 / (n_in + n_out))
return np.random.uniform(-limit, limit, (n_out, n_in))
def he_init(n_in, n_out):
"""He initialization for ReLU"""
std = np.sqrt(2.0 / n_in)
return np.random.randn(n_out, n_in) * std
def initialize_network(layer_sizes, init_method='he'):
"""
Initialize weights for a neural network
Parameters:
layer_sizes: List of layer sizes, e.g., [784, 128, 64, 10]
init_method: 'xavier' or 'he'
"""
weights = []
biases = []
for i in range(len(layer_sizes) - 1):
n_in = layer_sizes[i]
n_out = layer_sizes[i + 1]
if init_method == 'xavier':
W = xavier_init(n_in, n_out)
elif init_method == 'he':
W = he_init(n_in, n_out)
else:
W = np.random.randn(n_out, n_in) * 0.01
b = np.zeros((n_out, 1))
weights.append(W)
biases.append(b)
return weights, biases
# Example: Initialize a network
layer_sizes = [784, 256, 128, 10] # MNIST: 784 inputs → 10 outputs
weights, biases = initialize_network(layer_sizes, init_method='he')
print(f"Number of layers: {len(weights)}")
for i, (W, b) in enumerate(zip(weights, biases)):
print(f"Layer {i+1}: W shape {W.shape}, b shape {b.shape}")
Comparison: Different Initializations
| Method | Best For | Variance | Pros | Cons |
|---|---|---|---|---|
| Random | None (avoid) | Fixed | Simple | Often fails |
| Xavier | Tanh, Sigmoid | 1/n_in | Maintains variance | Poor for ReLU |
| He | ReLU, Leaky ReLU | 2/n_in | Best for ReLU | Not for sigmoid |
Complete Implementation
Full Feedforward Network Implementation
import numpy as np
class FeedforwardNetwork:
"""Complete Feedforward Neural Network Implementation"""
def __init__(self, layer_sizes, activation='relu', init_method='he'):
"""
Initialize network
Parameters:
layer_sizes: List of neurons per layer, e.g., [784, 256, 128, 10]
activation: 'relu', 'sigmoid', or 'tanh'
init_method: 'he' or 'xavier'
"""
self.layer_sizes = layer_sizes
self.activation = activation
self.weights = []
self.biases = []
# Initialize weights and biases
for i in range(len(layer_sizes) - 1):
n_in = layer_sizes[i]
n_out = layer_sizes[i + 1]
# Weight initialization
if init_method == 'he':
W = np.random.randn(n_out, n_in) * np.sqrt(2.0 / n_in)
elif init_method == 'xavier':
limit = np.sqrt(6.0 / (n_in + n_out))
W = np.random.uniform(-limit, limit, (n_out, n_in))
else:
W = np.random.randn(n_out, n_in) * 0.01
b = np.zeros((n_out, 1))
self.weights.append(W)
self.biases.append(b)
def _activate(self, z):
"""Apply activation function"""
if self.activation == 'relu':
return np.maximum(0, z)
elif self.activation == 'sigmoid':
# Clip to prevent overflow
z = np.clip(z, -250, 250)
return 1 / (1 + np.exp(-z))
elif self.activation == 'tanh':
return np.tanh(z)
else:
return z
def forward(self, X):
"""
Forward propagation
Parameters:
X: Input data (n_features, n_samples)
Returns:
activations: List of activations for each layer
"""
activations = [X] # Input layer
current_input = X
# Store intermediate values for backpropagation
self.z_values = []
for W, b in zip(self.weights, self.biases):
# Compute pre-activation
z = np.dot(W, current_input) + b
self.z_values.append(z)
# Apply activation
a = self._activate(z)
activations.append(a)
# Output becomes input for next layer
current_input = a
return activations
def predict(self, X):
"""Make predictions"""
activations = self.forward(X)
return activations[-1]
# Example: Create and test network
# MNIST-like: 784 inputs (28×28 image) → 256 hidden → 128 hidden → 10 outputs
network = FeedforwardNetwork(
layer_sizes=[784, 256, 128, 10],
activation='relu',
init_method='he'
)
# Test with random input (simulating 10 images)
X_test = np.random.randn(784, 10)
output = network.predict(X_test)
print(f"Input shape: {X_test.shape}")
print(f"Output shape: {output.shape}")
print(f"Output range: [{output.min():.3f}, {output.max():.3f}]")
Key Components:
- __init__: Sets up network architecture and initializes weights
- forward: Performs forward propagation through all layers
- _activate: Applies activation function (vectorized)
- predict: Wrapper for making predictions
- z_values: Stores pre-activations (needed for backpropagation)
Key Takeaways
- Forward propagation computes predictions by passing data through layers
- Matrix operations enable efficient batch processing
- Weight initialization is critical for successful training
- Layer-by-layer computation transforms input into output
- Activation functions introduce non-linearity at each layer