Chapter 5: Convolutional Neural Networks (CNNs)
Specialized networks for image processing and spatial data
Learning Objectives
- Understand the convolution operation and its purpose
- Master pooling layers and their role in CNNs
- Learn complete CNN architecture
- Understand backpropagation in convolutional layers
- Implement a CNN from scratch
- Recognize when to use CNNs vs MLPs
What are Convolutional Neural Networks?
🖼️ Specialized for Images
Convolutional Neural Networks (CNNs) are specialized neural networks designed for processing grid-like data such as images. Unlike fully connected networks, CNNs use convolution operations to automatically learn spatial hierarchies of features.
Think of CNNs like a detective examining a crime scene photo:
- Traditional Networks: Like trying to understand the entire photo at once - overwhelming and inefficient
- CNNs: Like examining small patches with a magnifying glass, looking for specific patterns (edges, textures, shapes) that appear throughout the image
- Key Insight: The same pattern (like an edge or corner) can appear anywhere in an image, so we use the same "detective tool" (filter) everywhere
📚 Why CNNs for Images? The Fundamental Problem
Problem with MLPs (Multi-Layer Perceptrons) for images:
The Parameter Explosion Problem
Consider a simple 28×28 grayscale image (like MNIST digits):
- 28×28 image = 784 pixels = 784 input neurons
- Fully connected to 100 hidden neurons = 784 × 100 = 78,400 weights!
- Plus biases: 100 more parameters
- Total for just one layer: 78,500 parameters
For a color image (224×224×3):
- 224×224×3 = 150,528 input neurons
- Connected to 1000 hidden neurons = 150,528,000 weights!
- This is computationally impossible and leads to overfitting
The Spatial Relationship Problem
MLPs treat images as flat lists of numbers:
- They have no understanding that pixel (5,10) is next to pixel (6,10)
- A pattern in the top-left corner requires completely different weights than the same pattern in the bottom-right
- This means the network must relearn the same pattern for every possible position
- Like teaching someone to recognize a cat, but they have to learn it separately for every possible location in a photo!
The CNN Solution: Three Key Innovations
1. Local Receptive Fields (Convolution):
- Instead of connecting every pixel to every neuron, we use small filters (e.g., 3×3 or 5×5)
- Each filter scans the entire image, looking for the same pattern everywhere
- Like using the same magnifying glass to look for fingerprints throughout the crime scene
- Result: One filter can detect edges, corners, or textures anywhere in the image
2. Parameter Sharing:
- The same filter weights are used at every position in the image
- If a 3×3 filter has 9 weights, those same 9 weights work everywhere
- Example: Instead of 78,400 weights, we might have 32 filters × 9 weights = 288 weights!
- This is a massive reduction in parameters
3. Hierarchical Feature Learning:
- Early layers: Learn simple features (edges, corners, lines)
- Middle layers: Combine simple features into shapes (circles, rectangles, curves)
- Deep layers: Combine shapes into complex objects (faces, cars, buildings)
- Like building a pyramid: start with small blocks (edges), build into larger structures (shapes), then complete objects
Real-World Analogy: The CNN Detective
Imagine a detective analyzing a crime scene photo:
- Traditional Network: Tries to understand the entire photo at once - too much information, misses details
- CNN Approach:
- Step 1: Uses a small magnifying glass (3×3 filter) to scan the entire photo, looking for specific patterns (edges, textures)
- Step 2: Combines findings from multiple scans to identify larger patterns (shapes, objects)
- Step 3: Combines these larger patterns to understand the scene (people, objects, relationships)
- Key Advantage: The same magnifying glass (filter) works everywhere - if it finds an edge in the top-left, it can find the same type of edge in the bottom-right using the same tool
Mathematical Intuition
Why convolution works for images:
Convolution as Pattern Matching
Convolution is essentially pattern matching:
- We slide a small template (filter) across the image
- At each position, we compute how well the template matches that region
- High match = strong activation = "this pattern is here!"
- Low match = weak activation = "this pattern is not here"
This is exactly what our visual system does:
- Our eyes have edge detectors (like certain CNN filters)
- These detectors work the same way regardless of where we look
- CNNs mimic this biological process!
The Convolution Operation
🔍 Sliding Window Filter
Convolution is a mathematical operation that applies a filter (kernel) to an input image by sliding it across the image and computing element-wise products.
Think of convolution like using a stencil or template:
- Image: A large piece of paper with a pattern
- Filter/Kernel: A small transparent stencil with a specific pattern
- Process: Place the stencil at different positions, see how well it matches, record the match score
- Result: A new "map" showing where the pattern appears in the original image
Real-world analogy: Like using a cookie cutter (filter) on dough (image) - you press it down at different positions to find where the pattern matches best!
Why Convolution Works: The Intuition
Convolution works because images have local patterns that repeat:
- Edges: The transition from dark to light appears many times in an image
- Textures: Patterns like wood grain, fabric weave, or brick patterns repeat
- Shapes: Corners, curves, and lines appear in different locations
Key Insight: Instead of learning to detect an edge at position (10, 20) separately from an edge at position (50, 100), we learn ONE edge detector that works everywhere!
Convolution Formula
For a 2D convolution:
Notation:
- I: Input image (matrix)
- K: Kernel/filter (small matrix, e.g., 3×3)
- *: Convolution operator
- [i, j]: Output position
- m, n: Kernel indices
Detailed Step-by-Step Example with Visual Diagram
Let's work through a complete convolution example:
Step 1: Our Input Image (5×5)
📊 Input Image Visualization
| 1 | 2 | 3 | 4 | 5 |
| 6 | 7 | 8 | 9 | 10 |
| 11 | 12 | 13 | 14 | 15 |
| 16 | 17 | 18 | 19 | 20 |
| 21 | 22 | 23 | 24 | 25 |
This represents a simple gradient image where values increase from top-left to bottom-right. Darker colors = higher values.
Step 2: Our Filter/Kernel (3×3) - Edge Enhancement
🔍 Filter/Kernel Visualization
| 0 | -1 | 0 |
| -1 | 5 | -1 |
| 0 | -1 | 0 |
What this filter does:
- Center value (5) - amplifies the center pixel
- Surrounding values (-1) - subtracts neighboring pixels
- Creates a sharpening effect - makes edges more pronounced
- If center is much brighter than neighbors → high output (edge detected!)
Step 3: Convolution at Position (1,1) - Visual Convolution Operation
We place the filter at the top-left corner and slide it across:
🔄 Convolution Operation Visualization
Image Region
| 1 | 2 | 3 |
| 6 | 7 | 8 |
| 11 | 12 | 13 |
Filter/Kernel
| 0 | -1 | 0 |
| -1 | 5 | -1 |
| 0 | -1 | 0 |
Element-wise Product
| 0 | -2 | 0 |
| -6 | 35 | -8 |
| 0 | -12 | 0 |
Sum all values:
Output[1,1] = 0 + (-2) + 0 + (-6) + 35 + (-8) + 0 + (-12) + 0 = 7
The center pixel (7) is amplified by 5×, neighbors are subtracted → Result: 7 (edge preserved!)
💡 Key Insight: The filter slides across the entire image, computing this operation at every position. This creates a feature map showing where edges (or other patterns) appear!
Output[1,1] = 0 - 2 - 6 + 35 - 8 - 12 = 7
Interpretation: The center pixel (7) is slightly brighter than its neighbors, so we get a positive but small output. This indicates a weak edge.
Step 4: Convolution at Position (2,2) - Stronger Edge
Now let's check position (2,2) where we have a stronger transition:
Image region:
[7 8 9 ] [12 13 14] [17 18 19]
Element-wise multiplication:
[7×0 8×(-1) 9×0 ] [0 -8 0 ] [12×(-1) 13×5 14×(-1)] = [-12 65 -14] [17×0 18×(-1) 19×0] [0 -18 0]
Sum:
Output[2,2] = 0 - 8 - 12 + 65 - 14 - 18 = 13
Interpretation: Higher output (13 vs 7) means a stronger edge detected! The center pixel (13) is more distinct from its neighbors.
🎯 Key Takeaways from This Example
- Convolution measures local patterns: It compares each pixel to its neighbors
- High output = strong pattern match: When the filter pattern matches the image region, output is high
- Low output = weak pattern match: When the pattern doesn't match, output is low
- Same filter, different positions: We use the SAME filter everywhere, but get different outputs based on what's in the image
Convolution Implementation
import numpy as np
def convolve2d(image, kernel, stride=1, padding=0):
"""
2D Convolution operation
Parameters:
image: Input image (H, W)
kernel: Filter/kernel (K_h, K_w)
stride: Step size for sliding
padding: Zero padding size
"""
# Add padding
if padding > 0:
image = np.pad(image, padding, mode='constant')
# Get dimensions
img_h, img_w = image.shape
kernel_h, kernel_w = kernel.shape
# Calculate output dimensions
out_h = (img_h - kernel_h) // stride + 1
out_w = (img_w - kernel_w) // stride + 1
# Initialize output
output = np.zeros((out_h, out_w))
# Perform convolution
for i in range(0, out_h):
for j in range(0, out_w):
# Extract region
region = image[i*stride:i*stride+kernel_h,
j*stride:j*stride+kernel_w]
# Element-wise multiplication and sum
output[i, j] = np.sum(region * kernel)
return output
# Example: Edge detection
image = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16]])
# Vertical edge detector
kernel = np.array([[-1, 0, 1],
[-1, 0, 1],
[-1, 0, 1]])
result = convolve2d(image, kernel)
print("Convolution result:\n", result)
Pooling Layers
📉 Downsampling Operation
Pooling layers reduce the spatial dimensions of feature maps, making the network more efficient and providing translation invariance. Common types include max pooling and average pooling.
Think of pooling like creating a summary or thumbnail:
- Before pooling: You have a detailed 1000×1000 pixel photo with every detail
- After pooling: You create a 500×500 thumbnail that captures the most important information
- Key benefit: The thumbnail is much smaller (faster to process) but still contains the essential features
- Real-world analogy: Like creating a summary of a long document - you keep the most important points, discard the details
Why Do We Need Pooling?
Three critical reasons:
1. Computational Efficiency
Problem: After convolution, feature maps can be very large
- Input: 224×224 image
- After first conv layer: 224×224×64 (64 feature maps!)
- This is 3.2 million values to process
- Each subsequent layer would be even larger without pooling
Solution: Max pooling reduces 224×224 to 112×112
- Now we have 800,000 values - 4× reduction!
- Much faster computation
- Less memory required
2. Translation Invariance
Problem: An object might appear at slightly different positions
- A cat's face might be at pixel (100, 150) in one image
- Same cat's face at pixel (102, 152) in another image
- Without pooling, these are treated as completely different!
Solution: Max pooling makes the network less sensitive to small shifts
- If an edge is detected anywhere in a 2×2 region, max pooling keeps it
- The network learns: "edge detected in this general area" not "edge at exact pixel (100, 150)"
- This makes the network more robust to object position
3. Feature Abstraction
Problem: Early layers detect very specific, local features
- Layer 1 might detect: "vertical edge at position (50, 75)"
- Layer 1 might detect: "vertical edge at position (50, 76)"
- These are essentially the same feature, but treated separately
Solution: Pooling combines nearby detections
- After pooling: "vertical edge in this region"
- This abstraction helps later layers build more complex features
- Like zooming out to see the bigger picture
Max Pooling
Takes the maximum value in each pooling window:
Where \(s\) is stride and \(p\) is pool size
Why Max Pooling?
- Reduces computation: Smaller feature maps
- Translation invariance: Detects features regardless of exact position
- Preserves strongest activations: Keeps most important features
Detailed Max Pooling Example
Let's work through a complete max pooling operation:
Step 1: Our Input Feature Map (4×4)
This represents activations from a convolutional layer:
[1 3 2 4 ] [5 7 6 8 ] [9 11 10 12] [13 15 14 16]
Interpretation: Each value represents how strongly a feature (like an edge) was detected at that location. Higher values = stronger detection.
Step 2: Max Pooling with 2×2 Window, Stride=2
We divide the 4×4 map into non-overlapping 2×2 regions:
Region 1 (Top-Left):
[1 3] [5 7]
Max value: max(1, 3, 5, 7) = 7
Meaning: The strongest feature activation in this region is 7. We keep this value and discard the others (1, 3, 5).
Region 2 (Top-Right):
[2 4] [6 8]
Max value: max(2, 4, 6, 8) = 8
Region 3 (Bottom-Left):
[9 11] [13 15]
Max value: max(9, 11, 13, 15) = 15
Region 4 (Bottom-Right):
[10 12] [14 16]
Max value: max(10, 12, 14, 16) = 16
Step 3: Final Output (2×2)
[7 8 ] [15 16]
Key observations:
- Size reduction: 4×4 → 2×2 (75% reduction in values!)
- Information preserved: We kept the strongest activations from each region
- Translation robustness: If the feature was at position (0,0) or (0,1), we still detect it in the top-left region
🎯 Why Max Pooling vs Average Pooling?
Max Pooling (what we just did):
- Keeps the strongest signal: "Was this feature detected strongly anywhere in this region?"
- Better for: Detecting presence of features (edges, textures, objects)
- Analogy: "Did anyone in this group see a cat?" → If one person saw it clearly, the answer is yes!
Average Pooling (alternative):
- Takes the average: "What's the average strength of features in this region?"
- Better for: Smoothing and reducing noise
- Analogy: "What's the average opinion of this group?" → Takes everyone's view into account
In practice: Max pooling is more commonly used because it preserves the strongest activations, which are most informative for detecting features.
Pooling Implementation
import numpy as np
def max_pooling(feature_map, pool_size=2, stride=2):
"""
Max pooling operation
Parameters:
feature_map: Input feature map (H, W)
pool_size: Size of pooling window
stride: Step size
"""
h, w = feature_map.shape
out_h = (h - pool_size) // stride + 1
out_w = (w - pool_size) // stride + 1
output = np.zeros((out_h, out_w))
for i in range(out_h):
for j in range(out_w):
region = feature_map[i*stride:i*stride+pool_size,
j*stride:j*stride+pool_size]
output[i, j] = np.max(region)
return output
# Example
feature_map = np.array([[1, 3, 2, 4],
[5, 7, 6, 8],
[9, 11, 10, 12],
[13, 15, 14, 16]])
pooled = max_pooling(feature_map, pool_size=2, stride=2)
print("Pooled result:\n", pooled)
Complete CNN Architecture
🏗️ Typical CNN Structure
A typical CNN consists of alternating convolutional and pooling layers, followed by fully connected layers for classification.
Standard Architecture:
- Convolutional Layers: Detect features (edges, textures, patterns)
- Pooling Layers: Reduce spatial dimensions
- More Conv+Pool: Learn higher-level features
- Flatten: Convert 2D to 1D
- Fully Connected: Final classification
Example: LeNet-5 Architecture
For 32×32 grayscale images:
- Input: 32×32×1
- Conv1: 6 filters, 5×5 → 28×28×6
- Pool1: 2×2 max pool → 14×14×6
- Conv2: 16 filters, 5×5 → 10×10×16
- Pool2: 2×2 max pool → 5×5×16
- Flatten: 400 neurons
- FC1: 120 neurons
- FC2: 84 neurons
- Output: 10 classes
Simple CNN Implementation
import numpy as np
class SimpleCNN:
"""Simple Convolutional Neural Network"""
def __init__(self):
# Convolutional layer: 1 input channel, 8 output channels, 3×3 kernel
self.conv_weights = np.random.randn(8, 1, 3, 3) * 0.1
self.conv_bias = np.zeros(8)
# Fully connected layer
self.fc_weights = np.random.randn(128, 8*26*26) * 0.1
self.fc_bias = np.zeros(128)
# Output layer
self.output_weights = np.random.randn(10, 128) * 0.1
self.output_bias = np.zeros(10)
def relu(self, x):
return np.maximum(0, x)
def forward(self, x):
"""Forward pass through CNN"""
# x shape: (batch, 1, 28, 28)
batch_size = x.shape[0]
# Convolution + ReLU
conv_out = self.conv2d(x, self.conv_weights, self.conv_bias)
conv_out = self.relu(conv_out)
# Max pooling (2×2)
pooled = self.max_pool2d(conv_out, 2)
# Flatten
flattened = pooled.reshape(batch_size, -1)
# Fully connected + ReLU
fc_out = np.dot(flattened, self.fc_weights.T) + self.fc_bias
fc_out = self.relu(fc_out)
# Output layer
output = np.dot(fc_out, self.output_weights.T) + self.output_bias
return output
def conv2d(self, x, weights, bias):
"""2D Convolution"""
# Simplified implementation
# In practice, use optimized libraries
pass
def max_pool2d(self, x, pool_size):
"""2D Max Pooling"""
# Simplified implementation
pass
# Usage
cnn = SimpleCNN()
# Note: Full implementation requires optimized convolution operations
Backpropagation in CNNs
Gradient Flow Through Convolution
Backpropagation in CNNs follows the same principles as regular networks, but gradients must be properly distributed through convolution and pooling operations.
Convolution Gradient
For convolution layer:
First equation: convolved with input. Second equation: convolved with kernel (transposed).
Key Insight:
- Gradient w.r.t. kernel: Convolve input with output gradient
- Gradient w.r.t. input: Convolve kernel (transposed) with output gradient
- This is the reverse of forward convolution
Pooling Gradient
For max pooling:
Max Pooling Backprop:
- Gradient goes only to the position that had the maximum value
- Other positions receive zero gradient
- This is why max pooling is non-differentiable but works in practice
Complete CNN Implementation
Full CNN with PyTorch-style Structure
import numpy as np
class ConvLayer:
"""Convolutional Layer"""
def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0):
self.in_channels = in_channels
self.out_channels = out_channels
self.kernel_size = kernel_size
self.stride = stride
self.padding = padding
# Initialize weights (He initialization)
self.weights = np.random.randn(out_channels, in_channels,
kernel_size, kernel_size) * np.sqrt(2.0 / (in_channels * kernel_size * kernel_size))
self.bias = np.zeros(out_channels)
def forward(self, x):
"""Forward pass"""
# x: (batch, in_channels, H, W)
batch_size, _, h, w = x.shape
# Calculate output dimensions
out_h = (h + 2*self.padding - self.kernel_size) // self.stride + 1
out_w = (w + 2*self.padding - self.kernel_size) // self.stride + 1
# Add padding
if self.padding > 0:
x = np.pad(x, ((0,0), (0,0), (self.padding, self.padding),
(self.padding, self.padding)), mode='constant')
# Initialize output
output = np.zeros((batch_size, self.out_channels, out_h, out_w))
# Perform convolution for each output channel
for b in range(batch_size):
for oc in range(self.out_channels):
for ic in range(self.in_channels):
for i in range(out_h):
for j in range(out_w):
region = x[b, ic, i*self.stride:i*self.stride+self.kernel_size,
j*self.stride:j*self.stride+self.kernel_size]
output[b, oc, i, j] += np.sum(region * self.weights[oc, ic])
output[b, oc] += self.bias[oc]
return output
class MaxPoolLayer:
"""Max Pooling Layer"""
def __init__(self, pool_size=2, stride=2):
self.pool_size = pool_size
self.stride = stride
def forward(self, x):
"""Forward pass"""
batch_size, channels, h, w = x.shape
out_h = (h - self.pool_size) // self.stride + 1
out_w = (w - self.pool_size) // self.stride + 1
output = np.zeros((batch_size, channels, out_h, out_w))
for b in range(batch_size):
for c in range(channels):
for i in range(out_h):
for j in range(out_w):
region = x[b, c, i*self.stride:i*self.stride+self.pool_size,
j*self.stride:j*self.stride+self.pool_size]
output[b, c, i, j] = np.max(region)
return output
# Example CNN
class SimpleCNN:
def __init__(self):
self.conv1 = ConvLayer(1, 8, 3, stride=1, padding=1)
self.pool1 = MaxPoolLayer(2, stride=2)
self.conv2 = ConvLayer(8, 16, 3, stride=1, padding=1)
self.pool2 = MaxPoolLayer(2, stride=2)
def forward(self, x):
x = self.conv1.forward(x)
x = np.maximum(0, x) # ReLU
x = self.pool1.forward(x)
x = self.conv2.forward(x)
x = np.maximum(0, x) # ReLU
x = self.pool2.forward(x)
return x
# Usage
cnn = SimpleCNN()
# Input: batch of 1, 1 channel, 28×28 images
x = np.random.randn(1, 1, 28, 28)
output = cnn.forward(x)
print(f"Output shape: {output.shape}") # (1, 16, 7, 7)