Chapter 5: Convolutional Neural Networks (CNNs)

Specialized networks for image processing and spatial data

Learning Objectives

Understand the convolution operation and its purpose
Master pooling layers and their role in CNNs
Learn complete CNN architecture
Understand backpropagation in convolutional layers
Implement a CNN from scratch
Recognize when to use CNNs vs MLPs

What are Convolutional Neural Networks?

🖼️ Specialized for Images

Convolutional Neural Networks (CNNs) are specialized neural networks designed for processing grid-like data such as images. Unlike fully connected networks, CNNs use convolution operations to automatically learn spatial hierarchies of features.

Think of CNNs like a detective examining a crime scene photo:

Traditional Networks: Like trying to understand the entire photo at once - overwhelming and inefficient
CNNs: Like examining small patches with a magnifying glass, looking for specific patterns (edges, textures, shapes) that appear throughout the image
Key Insight: The same pattern (like an edge or corner) can appear anywhere in an image, so we use the same "detective tool" (filter) everywhere

📚 Why CNNs for Images? The Fundamental Problem

Problem with MLPs (Multi-Layer Perceptrons) for images:

The Parameter Explosion Problem

Consider a simple 28×28 grayscale image (like MNIST digits):

28×28 image = 784 pixels = 784 input neurons
Fully connected to 100 hidden neurons = 784 × 100 = 78,400 weights!
Plus biases: 100 more parameters
Total for just one layer: 78,500 parameters

For a color image (224×224×3):

224×224×3 = 150,528 input neurons
Connected to 1000 hidden neurons = 150,528,000 weights!
This is computationally impossible and leads to overfitting

The Spatial Relationship Problem

MLPs treat images as flat lists of numbers:

They have no understanding that pixel (5,10) is next to pixel (6,10)
A pattern in the top-left corner requires completely different weights than the same pattern in the bottom-right
This means the network must relearn the same pattern for every possible position
Like teaching someone to recognize a cat, but they have to learn it separately for every possible location in a photo!

The CNN Solution: Three Key Innovations

1. Local Receptive Fields (Convolution):

Instead of connecting every pixel to every neuron, we use small filters (e.g., 3×3 or 5×5)
Each filter scans the entire image, looking for the same pattern everywhere
Like using the same magnifying glass to look for fingerprints throughout the crime scene
Result: One filter can detect edges, corners, or textures anywhere in the image

2. Parameter Sharing:

The same filter weights are used at every position in the image
If a 3×3 filter has 9 weights, those same 9 weights work everywhere
Example: Instead of 78,400 weights, we might have 32 filters × 9 weights = 288 weights!
This is a massive reduction in parameters

3. Hierarchical Feature Learning:

Early layers: Learn simple features (edges, corners, lines)
Middle layers: Combine simple features into shapes (circles, rectangles, curves)
Deep layers: Combine shapes into complex objects (faces, cars, buildings)
Like building a pyramid: start with small blocks (edges), build into larger structures (shapes), then complete objects

Real-World Analogy: The CNN Detective

Imagine a detective analyzing a crime scene photo:

Traditional Network: Tries to understand the entire photo at once - too much information, misses details
CNN Approach:
- Step 1: Uses a small magnifying glass (3×3 filter) to scan the entire photo, looking for specific patterns (edges, textures)
- Step 2: Combines findings from multiple scans to identify larger patterns (shapes, objects)
- Step 3: Combines these larger patterns to understand the scene (people, objects, relationships)
Key Advantage: The same magnifying glass (filter) works everywhere - if it finds an edge in the top-left, it can find the same type of edge in the bottom-right using the same tool

Mathematical Intuition

Why convolution works for images:

Convolution as Pattern Matching

Convolution is essentially pattern matching:

We slide a small template (filter) across the image
At each position, we compute how well the template matches that region
High match = strong activation = "this pattern is here!"
Low match = weak activation = "this pattern is not here"

This is exactly what our visual system does:

Our eyes have edge detectors (like certain CNN filters)
These detectors work the same way regardless of where we look
CNNs mimic this biological process!

The Convolution Operation

🔍 Sliding Window Filter

Convolution is a mathematical operation that applies a filter (kernel) to an input image by sliding it across the image and computing element-wise products.

Think of convolution like using a stencil or template:

Image: A large piece of paper with a pattern
Filter/Kernel: A small transparent stencil with a specific pattern
Process: Place the stencil at different positions, see how well it matches, record the match score
Result: A new "map" showing where the pattern appears in the original image

Real-world analogy: Like using a cookie cutter (filter) on dough (image) - you press it down at different positions to find where the pattern matches best!

Why Convolution Works: The Intuition

Convolution works because images have local patterns that repeat:

Edges: The transition from dark to light appears many times in an image
Textures: Patterns like wood grain, fabric weave, or brick patterns repeat
Shapes: Corners, curves, and lines appear in different locations

Key Insight: Instead of learning to detect an edge at position (10, 20) separately from an edge at position (50, 100), we learn ONE edge detector that works everywhere!

Convolution Formula

For a 2D convolution:

\[(I * K)[i, j] = \\sum_{m} \\sum_{n} I[i+m, j+n] \\times K[m, n]\]

Notation:

I: Input image (matrix)
K: Kernel/filter (small matrix, e.g., 3×3)
*: Convolution operator
[i, j]: Output position
m, n: Kernel indices

Detailed Step-by-Step Example with Visual Diagram

Let's work through a complete convolution example:

Step 1: Our Input Image (5×5)

📊 Input Image Visualization

1	2	3	4	5
6	7	8	9	10
11	12	13	14	15
16	17	18	19	20
21	22	23	24	25

This represents a simple gradient image where values increase from top-left to bottom-right. Darker colors = higher values.

Step 2: Our Filter/Kernel (3×3) - Edge Enhancement

🔍 Filter/Kernel Visualization

0	-1	0
-1	5	-1
0	-1	0

What this filter does:

Center value (5) - amplifies the center pixel
Surrounding values (-1) - subtracts neighboring pixels
Creates a sharpening effect - makes edges more pronounced
If center is much brighter than neighbors → high output (edge detected!)

Step 3: Convolution at Position (1,1) - Visual Convolution Operation

We place the filter at the top-left corner and slide it across:

🔄 Convolution Operation Visualization

Image Region

1	2	3
6	7	8
11	12	13

Filter/Kernel

0	-1	0
-1	5	-1
0	-1	0

Element-wise Product

0	-2	0
-6	35	-8
0	-12	0

Sum all values:

Output[1,1] = 0 + (-2) + 0 + (-6) + 35 + (-8) + 0 + (-12) + 0 = 7

The center pixel (7) is amplified by 5×, neighbors are subtracted → Result: 7 (edge preserved!)

💡 Key Insight: The filter slides across the entire image, computing this operation at every position. This creates a feature map showing where edges (or other patterns) appear!

Output[1,1] = 0 - 2 - 6 + 35 - 8 - 12 = 7

Interpretation: The center pixel (7) is slightly brighter than its neighbors, so we get a positive but small output. This indicates a weak edge.

Step 4: Convolution at Position (2,2) - Stronger Edge

Now let's check position (2,2) where we have a stronger transition:

Image region:

[7   8   9 ]
[12  13  14]
[17  18  19]

Element-wise multiplication:

[7×0   8×(-1)  9×0 ]   [0   -8   0 ]
[12×(-1) 13×5  14×(-1)] = [-12  65  -14]
[17×0  18×(-1) 19×0]   [0  -18   0]

Sum:

Output[2,2] = 0 - 8 - 12 + 65 - 14 - 18 = 13

Interpretation: Higher output (13 vs 7) means a stronger edge detected! The center pixel (13) is more distinct from its neighbors.

🎯 Key Takeaways from This Example

Convolution measures local patterns: It compares each pixel to its neighbors
High output = strong pattern match: When the filter pattern matches the image region, output is high
Low output = weak pattern match: When the pattern doesn't match, output is low
Same filter, different positions: We use the SAME filter everywhere, but get different outputs based on what's in the image

Convolution Implementation

import numpy as np

def convolve2d(image, kernel, stride=1, padding=0):
    """
    2D Convolution operation
    
    Parameters:
    image: Input image (H, W)
    kernel: Filter/kernel (K_h, K_w)
    stride: Step size for sliding
    padding: Zero padding size
    """
    # Add padding
    if padding > 0:
        image = np.pad(image, padding, mode='constant')
    
    # Get dimensions
    img_h, img_w = image.shape
    kernel_h, kernel_w = kernel.shape
    
    # Calculate output dimensions
    out_h = (img_h - kernel_h) // stride + 1
    out_w = (img_w - kernel_w) // stride + 1
    
    # Initialize output
    output = np.zeros((out_h, out_w))
    
    # Perform convolution
    for i in range(0, out_h):
        for j in range(0, out_w):
            # Extract region
            region = image[i*stride:i*stride+kernel_h, 
                          j*stride:j*stride+kernel_w]
            # Element-wise multiplication and sum
            output[i, j] = np.sum(region * kernel)
    
    return output

# Example: Edge detection
image = np.array([[1, 2, 3, 4],
                  [5, 6, 7, 8],
                  [9, 10, 11, 12],
                  [13, 14, 15, 16]])

# Vertical edge detector
kernel = np.array([[-1, 0, 1],
                   [-1, 0, 1],
                   [-1, 0, 1]])

result = convolve2d(image, kernel)
print("Convolution result:\n", result)

Pooling Layers

📉 Downsampling Operation

Pooling layers reduce the spatial dimensions of feature maps, making the network more efficient and providing translation invariance. Common types include max pooling and average pooling.

Think of pooling like creating a summary or thumbnail:

Before pooling: You have a detailed 1000×1000 pixel photo with every detail
After pooling: You create a 500×500 thumbnail that captures the most important information
Key benefit: The thumbnail is much smaller (faster to process) but still contains the essential features
Real-world analogy: Like creating a summary of a long document - you keep the most important points, discard the details

Why Do We Need Pooling?

Three critical reasons:

1. Computational Efficiency

Problem: After convolution, feature maps can be very large

Input: 224×224 image
After first conv layer: 224×224×64 (64 feature maps!)
This is 3.2 million values to process
Each subsequent layer would be even larger without pooling

Solution: Max pooling reduces 224×224 to 112×112

Now we have 800,000 values - 4× reduction!
Much faster computation
Less memory required

2. Translation Invariance

Problem: An object might appear at slightly different positions

A cat's face might be at pixel (100, 150) in one image
Same cat's face at pixel (102, 152) in another image
Without pooling, these are treated as completely different!

Solution: Max pooling makes the network less sensitive to small shifts

If an edge is detected anywhere in a 2×2 region, max pooling keeps it
The network learns: "edge detected in this general area" not "edge at exact pixel (100, 150)"
This makes the network more robust to object position

3. Feature Abstraction

Problem: Early layers detect very specific, local features

Layer 1 might detect: "vertical edge at position (50, 75)"
Layer 1 might detect: "vertical edge at position (50, 76)"
These are essentially the same feature, but treated separately

Solution: Pooling combines nearby detections

After pooling: "vertical edge in this region"
This abstraction helps later layers build more complex features
Like zooming out to see the bigger picture

Max Pooling

Takes the maximum value in each pooling window:

\[Output[i, j] = max(Input[i \\times s : i \\times s + p, j \\times s : j \\times s + p])\]

Where s is stride and p is pool size

Why Max Pooling?

Reduces computation: Smaller feature maps
Translation invariance: Detects features regardless of exact position
Preserves strongest activations: Keeps most important features

Detailed Max Pooling Example

Let's work through a complete max pooling operation:

Step 1: Our Input Feature Map (4×4)

This represents activations from a convolutional layer:

[1   3   2   4 ]
[5   7   6   8 ]
[9   11  10  12]
[13  15  14  16]

Interpretation: Each value represents how strongly a feature (like an edge) was detected at that location. Higher values = stronger detection.

Step 2: Max Pooling with 2×2 Window, Stride=2

We divide the 4×4 map into non-overlapping 2×2 regions:

Region 1 (Top-Left):

[1   3]
[5   7]

Max value: max(1, 3, 5, 7) = 7

Meaning: The strongest feature activation in this region is 7. We keep this value and discard the others (1, 3, 5).

Region 2 (Top-Right):

[2   4]
[6   8]

Max value: max(2, 4, 6, 8) = 8

Region 3 (Bottom-Left):

[9   11]
[13  15]

Max value: max(9, 11, 13, 15) = 15

Region 4 (Bottom-Right):

[10  12]
[14  16]

Max value: max(10, 12, 14, 16) = 16

Step 3: Final Output (2×2)

[7   8 ]
[15  16]

Key observations:

Size reduction: 4×4 → 2×2 (75% reduction in values!)
Information preserved: We kept the strongest activations from each region
Translation robustness: If the feature was at position (0,0) or (0,1), we still detect it in the top-left region

🎯 Why Max Pooling vs Average Pooling?

Max Pooling (what we just did):

Keeps the strongest signal: "Was this feature detected strongly anywhere in this region?"
Better for: Detecting presence of features (edges, textures, objects)
Analogy: "Did anyone in this group see a cat?" → If one person saw it clearly, the answer is yes!

Average Pooling (alternative):

Takes the average: "What's the average strength of features in this region?"
Better for: Smoothing and reducing noise
Analogy: "What's the average opinion of this group?" → Takes everyone's view into account

In practice: Max pooling is more commonly used because it preserves the strongest activations, which are most informative for detecting features.

Pooling Implementation

import numpy as np

def max_pooling(feature_map, pool_size=2, stride=2):
    """
    Max pooling operation
    
    Parameters:
    feature_map: Input feature map (H, W)
    pool_size: Size of pooling window
    stride: Step size
    """
    h, w = feature_map.shape
    out_h = (h - pool_size) // stride + 1
    out_w = (w - pool_size) // stride + 1
    
    output = np.zeros((out_h, out_w))
    
    for i in range(out_h):
        for j in range(out_w):
            region = feature_map[i*stride:i*stride+pool_size,
                                j*stride:j*stride+pool_size]
            output[i, j] = np.max(region)
    
    return output

# Example
feature_map = np.array([[1, 3, 2, 4],
                        [5, 7, 6, 8],
                        [9, 11, 10, 12],
                        [13, 15, 14, 16]])

pooled = max_pooling(feature_map, pool_size=2, stride=2)
print("Pooled result:\n", pooled)

Complete CNN Architecture

🏗️ Typical CNN Structure

A typical CNN consists of alternating convolutional and pooling layers, followed by fully connected layers for classification.

Standard Architecture:

Convolutional Layers: Detect features (edges, textures, patterns)
Pooling Layers: Reduce spatial dimensions
More Conv+Pool: Learn higher-level features
Flatten: Convert 2D to 1D
Fully Connected: Final classification

Example: LeNet-5 Architecture

For 32×32 grayscale images:

Input: 32×32×1
Conv1: 6 filters, 5×5 → 28×28×6
Pool1: 2×2 max pool → 14×14×6
Conv2: 16 filters, 5×5 → 10×10×16
Pool2: 2×2 max pool → 5×5×16
Flatten: 400 neurons
FC1: 120 neurons
FC2: 84 neurons
Output: 10 classes

Simple CNN Implementation

import numpy as np

class SimpleCNN:
    """Simple Convolutional Neural Network"""
    
    def __init__(self):
        # Convolutional layer: 1 input channel, 8 output channels, 3×3 kernel
        self.conv_weights = np.random.randn(8, 1, 3, 3) * 0.1
        self.conv_bias = np.zeros(8)
        
        # Fully connected layer
        self.fc_weights = np.random.randn(128, 8*26*26) * 0.1
        self.fc_bias = np.zeros(128)
        
        # Output layer
        self.output_weights = np.random.randn(10, 128) * 0.1
        self.output_bias = np.zeros(10)
    
    def relu(self, x):
        return np.maximum(0, x)
    
    def forward(self, x):
        """Forward pass through CNN"""
        # x shape: (batch, 1, 28, 28)
        batch_size = x.shape[0]
        
        # Convolution + ReLU
        conv_out = self.conv2d(x, self.conv_weights, self.conv_bias)
        conv_out = self.relu(conv_out)
        
        # Max pooling (2×2)
        pooled = self.max_pool2d(conv_out, 2)
        
        # Flatten
        flattened = pooled.reshape(batch_size, -1)
        
        # Fully connected + ReLU
        fc_out = np.dot(flattened, self.fc_weights.T) + self.fc_bias
        fc_out = self.relu(fc_out)
        
        # Output layer
        output = np.dot(fc_out, self.output_weights.T) + self.output_bias
        
        return output
    
    def conv2d(self, x, weights, bias):
        """2D Convolution"""
        # Simplified implementation
        # In practice, use optimized libraries
        pass
    
    def max_pool2d(self, x, pool_size):
        """2D Max Pooling"""
        # Simplified implementation
        pass

# Usage
cnn = SimpleCNN()
# Note: Full implementation requires optimized convolution operations

Backpropagation in CNNs

Gradient Flow Through Convolution

Backpropagation in CNNs follows the same principles as regular networks, but gradients must be properly distributed through convolution and pooling operations.

Convolution Gradient

For convolution layer:

\[\\partialL/\\partialK = I * (\\partialL/\\partialO)\]

∂L/∂I = K * (∂L/∂O)

First equation: convolved with input. Second equation: convolved with kernel (transposed).

Key Insight:

Gradient w.r.t. kernel: Convolve input with output gradient
Gradient w.r.t. input: Convolve kernel (transposed) with output gradient
This is the reverse of forward convolution

Pooling Gradient

For max pooling:

\[\\partialL/\\partialInput[i, j] = { \\ \\partialL/\\partialOutput[i′, j′] if Input[i, j] = max(pool region) \\ 0 otherwise \\ }\]

Max Pooling Backprop:

Gradient goes only to the position that had the maximum value
Other positions receive zero gradient
This is why max pooling is non-differentiable but works in practice

Complete CNN Implementation

Full CNN with PyTorch-style Structure

import numpy as np

class ConvLayer:
    """Convolutional Layer"""
    
    def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0):
        self.in_channels = in_channels
        self.out_channels = out_channels
        self.kernel_size = kernel_size
        self.stride = stride
        self.padding = padding
        
        # Initialize weights (He initialization)
        self.weights = np.random.randn(out_channels, in_channels, 
                                      kernel_size, kernel_size) * np.sqrt(2.0 / (in_channels * kernel_size * kernel_size))
        self.bias = np.zeros(out_channels)
    
    def forward(self, x):
        """Forward pass"""
        # x: (batch, in_channels, H, W)
        batch_size, _, h, w = x.shape
        
        # Calculate output dimensions
        out_h = (h + 2*self.padding - self.kernel_size) // self.stride + 1
        out_w = (w + 2*self.padding - self.kernel_size) // self.stride + 1
        
        # Add padding
        if self.padding > 0:
            x = np.pad(x, ((0,0), (0,0), (self.padding, self.padding), 
                          (self.padding, self.padding)), mode='constant')
        
        # Initialize output
        output = np.zeros((batch_size, self.out_channels, out_h, out_w))
        
        # Perform convolution for each output channel
        for b in range(batch_size):
            for oc in range(self.out_channels):
                for ic in range(self.in_channels):
                    for i in range(out_h):
                        for j in range(out_w):
                            region = x[b, ic, i*self.stride:i*self.stride+self.kernel_size,
                                      j*self.stride:j*self.stride+self.kernel_size]
                            output[b, oc, i, j] += np.sum(region * self.weights[oc, ic])
                output[b, oc] += self.bias[oc]
        
        return output

class MaxPoolLayer:
    """Max Pooling Layer"""
    
    def __init__(self, pool_size=2, stride=2):
        self.pool_size = pool_size
        self.stride = stride
    
    def forward(self, x):
        """Forward pass"""
        batch_size, channels, h, w = x.shape
        out_h = (h - self.pool_size) // self.stride + 1
        out_w = (w - self.pool_size) // self.stride + 1
        
        output = np.zeros((batch_size, channels, out_h, out_w))
        
        for b in range(batch_size):
            for c in range(channels):
                for i in range(out_h):
                    for j in range(out_w):
                        region = x[b, c, i*self.stride:i*self.stride+self.pool_size,
                                  j*self.stride:j*self.stride+self.pool_size]
                        output[b, c, i, j] = np.max(region)
        
        return output

# Example CNN
class SimpleCNN:
    def __init__(self):
        self.conv1 = ConvLayer(1, 8, 3, stride=1, padding=1)
        self.pool1 = MaxPoolLayer(2, stride=2)
        self.conv2 = ConvLayer(8, 16, 3, stride=1, padding=1)
        self.pool2 = MaxPoolLayer(2, stride=2)
    
    def forward(self, x):
        x = self.conv1.forward(x)
        x = np.maximum(0, x)  # ReLU
        x = self.pool1.forward(x)
        x = self.conv2.forward(x)
        x = np.maximum(0, x)  # ReLU
        x = self.pool2.forward(x)
        return x

# Usage
cnn = SimpleCNN()
# Input: batch of 1, 1 channel, 28×28 images
x = np.random.randn(1, 1, 28, 28)
output = cnn.forward(x)
print(f"Output shape: {output.shape}")  # (1, 16, 7, 7)

Test Your Understanding

Question 1: What is the main advantage of CNNs over fully connected networks for images?

A) They are faster to train

B) Parameter sharing and spatial feature learning

C) They require less memory

D) They always give better accuracy

Question 2: What does max pooling do?

A) Increases image resolution

B) Reduces spatial dimensions and provides translation invariance

C) Applies filters to images

D) Adds noise to prevent overfitting

Question 3: In a CNN, what do early convolutional layers typically learn?

A) Low-level features like edges and textures

B) High-level object recognition

C) Classification decisions

D) Only color information

Question 4: What is the mathematical formula for convolution operation?

A) \((I * K)[i, j] = \sum_m \sum_n I[i+m, j+n] \times K[m, n]\) where I is input and K is kernel

B) \((I * K) = I + K\)

C) \((I * K) = I × K\)

D) \((I * K) = I - K\)

Question 5: What is parameter sharing in CNNs?

A) The same filter weights are used at every spatial location, dramatically reducing the number of parameters compared to fully connected layers

B) All layers share the same parameters

C) Parameters are randomly shared

D) No parameters are used

Question 6: What is the purpose of pooling layers in CNNs?

A) To reduce spatial dimensions, decrease computational cost, and provide translation invariance by downsampling feature maps

B) To increase image resolution

C) To add more parameters

D) To remove all features

Question 7: What is the formula for max pooling?

A) \(Output[i, j] = max(Input[i × s : i × s + p, j × s : j × s + p])\) where s is stride and p is pool size

B) \(Output = Input + 1\)

C) \(Output = Input × 2\)

D) \(Output = mean(Input)\)

Question 8: How does backpropagation work in convolutional layers?

A) \(Gradients are computed using convolution operations: ∂L/∂K = I * (∂L/∂O) and ∂L/∂I = K * (∂L/∂O)\) where the second uses a transposed convolution

B) Same as fully connected layers

C) No gradients are computed

D) Only forward pass is used

Question 9: What is the difference between valid and same padding in convolution?

A) Valid padding uses no padding, output size is smaller than input. Same padding adds zeros to maintain output size equal to input size

B) They're identical

C) Valid padding is larger

D) Same padding reduces size

Question 10: What is the output size formula for convolution with padding and stride?

A) \(output_size = (input_size + 2 × padding - kernel_size) / stride + 1\)

B) \(output_size = input_size\)

C) \(output_size = input_size × 2\)

D) \(output_size = input_size - kernel_size\)

Question 11: Why are CNNs translation invariant?

A) Because the same filters scan the entire image, and pooling layers reduce sensitivity to exact positions, allowing the network to recognize patterns regardless of their location

B) They're not translation invariant

C) Only because of pooling

D) Only because of convolution

Question 12: What is the typical architecture progression in a CNN?

A) \(Early layers: small filters detecting edges/textures → Middle layers: larger filters detecting patterns/shapes → Late layers: detecting high-level features/objects → Fully connected layers: classification\)

B) All layers are identical

C) Only convolution, no pooling

D) Only pooling, no convolution