Chapter 6: Recurrent Neural Networks (RNNs)
Networks with memory for sequential data processing
Learning Objectives
- Understand why RNNs are needed for sequences
- Master the recurrence relation in RNNs
- Learn unrolling/unfolding RNNs through time
- Understand Backpropagation Through Time (BPTT)
- Recognize the vanishing gradient problem
- Implement a simple RNN from scratch
What are Recurrent Neural Networks?
Networks with Memory
Recurrent Neural Networks (RNNs) are designed to process sequential data by maintaining a hidden state that captures information about previous inputs. Unlike feedforward networks, RNNs have connections that form cycles, allowing information to persist.
Think of RNNs like reading a book:
- Feedforward Network: Like reading random sentences from different pages - no context, no understanding of the story
- RNN: Like reading page by page, remembering what happened before - you understand the plot because you remember previous chapters
- Hidden State: Like your memory of the story so far - it influences how you understand the current page
The Fundamental Problem with Feedforward Networks
Why can't we just use regular neural networks for sequences?
Problem 1: Fixed Input Size
Feedforward networks require fixed-size inputs:
- If your network expects 10 inputs, you MUST provide exactly 10
- But sequences vary in length: "Hello" (5 characters) vs "Hello, how are you?" (19 characters)
- Solution needed: Process one element at a time, regardless of sequence length
Problem 2: No Memory
Feedforward networks have no memory of previous inputs:
Example sentence: "The cat sat on the mat because it was tired."
- To understand "it", you need to remember "cat" from earlier
- Feedforward network: "What is 'it'?" → No idea, no memory of "cat"
- RNN: "What is 'it'?" → Remembers "cat" from hidden state → "it" refers to "cat"
Problem 3: Order Matters
In sequences, order is critical:
- "Dog bites man" vs "Man bites dog" - completely different meanings!
- Feedforward network: Treats input as unordered set → can't distinguish
- RNN: Processes in order → understands the difference
The RNN Solution: Three Key Innovations
1. Sequential Processing
RNNs process sequences one element at a time:
- Input: "The cat sat"
- Step 1: Process "The" → update hidden state
- Step 2: Process "cat" → update hidden state (remembers "The")
- Step 3: Process "sat" → update hidden state (remembers "The cat")
- Key: Each step builds on previous context
2. Hidden State (Memory)
The hidden state is the RNN's memory:
- It's a vector that encodes information from all previous inputs
- Like a summary of "what has happened so far"
- Updated at each time step to incorporate new information
- Analogy: Like a running summary of a conversation - you remember key points, not every word
3. Shared Weights
The same weights are used at every time step:
- Same transformation applied to each input
- Like using the same "reading comprehension" skill for every word
- Benefit: Much fewer parameters than having separate weights for each position
- Analogy: Like using the same grammar rules for every sentence, regardless of position
📚 Real-World Analogy: The RNN as a Conversation
Imagine having a conversation:
- Feedforward Network: Like talking to someone with amnesia - they respond to each sentence independently, forgetting everything you said before
- RNN: Like a normal conversation - the person remembers what you discussed earlier and uses that context to understand your current statement
Example conversation:
- You: "I have a pet."
- RNN hidden state: [remembers: "pet mentioned"]
- You: "It's very fluffy."
- RNN: Uses hidden state → "It" must refer to the pet → understands correctly
- You: "It likes to sleep."
- RNN: Still remembers the pet → continues to understand "it"
🔄 RNN Sequential Flow Visualization
Input: "The"
h₀
[0, 0, 0]
Input: "cat"
h₁
[remembers "The"]
Input: "sat"
h₂
[remembers "The cat"]
💡 Key: Each hidden state (h₀, h₁, h₂) accumulates information from all previous inputs. The arrow (→) shows information flowing forward through time!
The Recurrence Relation
The Heart of RNNs
The recurrence relation is what makes RNNs special - it connects the current state to previous states, creating memory.
Think of it like updating your understanding:
- Previous understanding (h_{t-1}): What you knew before
- New information (x_t): What you're learning now
- Updated understanding (h_t): Combines old and new information
- Key insight: Your current understanding depends on both what you knew before AND what you're learning now
RNN Forward Pass - The Complete Formula
At each time step t, the RNN performs two operations:
Breaking Down Step 1: Hidden State Update
h_t = tanh(W_hh × h_{t-1} + W_xh × x_t + b_h)
This formula has three parts:
- W_hh × h_{t-1}: "What to remember from the past"
- Transforms previous hidden state
- Decides which parts of memory to keep
- Like filtering your memory: "What's still relevant?"
- W_xh × x_t: "What to learn from the current input"
- Transforms current input
- Extracts relevant information
- Like taking notes: "What's important in this new information?"
- + b_h: Bias term (allows shifting the activation)
- tanh(...): Activation function (squashes values to [-1, 1])
In plain English: "My new understanding = combine (filtered memory from before + processed new information), then normalize it"
Breaking Down Step 2: Output Computation
y_t = W_hy × h_t + b_y
This is simpler:
- Take the updated hidden state
- Transform it to produce the desired output
- Like translating your understanding into an answer
Notation Reference:
- h_t: Hidden state at time t (the memory/understanding)
- h_{t-1}: Hidden state at previous time step (previous memory)
- x_t: Input at time t (new information)
- y_t: Output at time t (prediction/response)
- W_hh: Hidden-to-hidden weight matrix (how to update memory)
- W_xh: Input-to-hidden weight matrix (how to process new input)
- W_hy: Hidden-to-output weight matrix (how to generate output)
- b_h, b_y: Bias terms (allow shifting activations)
Step-by-Step Example: Processing "The cat sat"
Let's trace through how an RNN processes this sentence:
Time Step 1: Processing "The"
- Input x₁: "The" (encoded as a vector, e.g., [0.2, 0.1, 0.8, ...])
- Previous hidden state h₀: [0, 0, 0, ...] (initialized to zeros - no memory yet)
- Compute h₁:
- W_hh × h₀ = [0, 0, 0, ...] (no previous memory)
- W_xh × x₁ = [0.1, 0.3, 0.2, ...] (processed "The")
- h₁ = tanh([0, 0, 0, ...] + [0.1, 0.3, 0.2, ...] + b_h)
- h₁ = [0.1, 0.3, 0.2, ...] (memory now contains information about "The")
- Output y₁: W_hy × h₁ + b_y (might predict next word or classify)
Time Step 2: Processing "cat"
- Input x₂: "cat" (encoded as [0.5, 0.7, 0.1, ...])
- Previous hidden state h₁: [0.1, 0.3, 0.2, ...] (remembers "The")
- Compute h₂:
- W_hh × h₁ = [0.05, 0.15, 0.1, ...] (filtered memory of "The")
- W_xh × x₂ = [0.3, 0.4, 0.2, ...] (processed "cat")
- h₂ = tanh([0.05, 0.15, 0.1, ...] + [0.3, 0.4, 0.2, ...] + b_h)
- h₂ = [0.35, 0.55, 0.3, ...] (memory now contains "The cat")
- Key insight: h₂ combines information from both "The" and "cat"!
Time Step 3: Processing "sat"
- Input x₃: "sat" (encoded as [0.2, 0.3, 0.6, ...])
- Previous hidden state h₂: [0.35, 0.55, 0.3, ...] (remembers "The cat")
- Compute h₃:
- W_hh × h₂ = [0.18, 0.28, 0.15, ...] (filtered memory of "The cat")
- W_xh × x₃ = [0.1, 0.15, 0.3, ...] (processed "sat")
- h₃ = tanh([0.18, 0.28, 0.15, ...] + [0.1, 0.15, 0.3, ...] + b_h)
- h₃ = [0.28, 0.43, 0.45, ...] (memory now contains "The cat sat")
- Result: The hidden state has accumulated information from the entire sequence!
🎯 Key Takeaways
- Information accumulates: Each time step adds to the memory
- Context builds: Later predictions can use information from much earlier
- Same weights: W_hh, W_xh, W_hy are used at every time step
- Variable length: Can process sequences of any length using the same process
Simple RNN Implementation
import numpy as np
class SimpleRNN:
"""Simple Recurrent Neural Network"""
def __init__(self, input_size, hidden_size, output_size):
# Weight matrices
self.W_xh = np.random.randn(hidden_size, input_size) * 0.1
self.W_hh = np.random.randn(hidden_size, hidden_size) * 0.1
self.W_hy = np.random.randn(output_size, hidden_size) * 0.1
# Biases
self.b_h = np.zeros((hidden_size, 1))
self.b_y = np.zeros((output_size, 1))
# Hidden state (initialized to zero)
self.h = np.zeros((hidden_size, 1))
def tanh(self, x):
return np.tanh(x)
def forward_step(self, x):
"""Process one time step"""
# Update hidden state
self.h = self.tanh(np.dot(self.W_xh, x) +
np.dot(self.W_hh, self.h) +
self.b_h)
# Compute output
y = np.dot(self.W_hy, self.h) + self.b_y
return y, self.h
def forward(self, sequence):
"""Process entire sequence"""
outputs = []
self.h = np.zeros((self.W_hh.shape[0], 1)) # Reset state
for x in sequence:
y, h = self.forward_step(x)
outputs.append(y)
return outputs
# Example: Process sequence
rnn = SimpleRNN(input_size=3, hidden_size=4, output_size=2)
sequence = [np.array([[1], [2], [3]]),
np.array([[4], [5], [6]]),
np.array([[7], [8], [9]])]
outputs = rnn.forward(sequence)
print(f"Processed {len(outputs)} time steps")
Unfolding RNNs Through Time
📖 Visualizing Recurrence
Unfolding (or unrolling) an RNN means visualizing it as a feedforward network where each time step becomes a separate layer. This helps understand how information flows through time.
Unfolding Example
For a sequence of length 3:
- t=0: h₀ = tanh(W_xh × x₀ + W_hh × h_{-1} + b_h)
- t=1: h₁ = tanh(W_xh × x₁ + W_hh × h₀ + b_h)
- t=2: h₂ = tanh(W_xh × x₂ + W_hh × h₁ + b_h)
Unfolded view:
x₀ → [RNN] → h₀ → [RNN] → h₁ → [RNN] → h₂
↑ ↑ ↑
└────────────┴────────────┘
(shared weights W_hh)
Backpropagation Through Time (BPTT)
Training RNNs
Backpropagation Through Time (BPTT) is the algorithm used to train RNNs. It's essentially backpropagation applied to the unfolded network, computing gradients across all time steps.
BPTT Gradient Computation
Gradient flows backward through time:
Key Insight:
- Gradient at time t depends on gradient at time t+1
- Must compute gradients backward through entire sequence
- Gradients are multiplied by W_hh at each step
- This can cause vanishing or exploding gradients
The Vanishing Gradient Problem
⚠️ RNN's Main Limitation
The vanishing gradient problem occurs when gradients become exponentially small as they propagate backward through time. This makes it difficult for RNNs to learn long-term dependencies.
Why Gradients Vanish
Gradient at time t-k:
Problem:
- Gradient is multiplied by W_hh k times
- If |W_hh| < 1, gradient shrinks exponentially
- tanh'(z) ≤ 1, further reducing gradient
- Result: Early time steps receive almost no gradient
📚 Real-World Impact
Example: Language Modeling
Sentence: "The cat, which was very fluffy, sat on the mat."
To predict "sat", the network needs to remember "cat" from much earlier. With vanishing gradients, the network can't learn this long-term dependency!
Solution: LSTMs and GRUs (covered in Chapter 7) solve this problem.
Complete RNN Implementation
Full RNN with BPTT
import numpy as np
class RNN:
"""Recurrent Neural Network with Backpropagation Through Time"""
def __init__(self, input_size, hidden_size, output_size):
self.input_size = input_size
self.hidden_size = hidden_size
self.output_size = output_size
# Initialize weights
self.W_xh = np.random.randn(hidden_size, input_size) * 0.01
self.W_hh = np.random.randn(hidden_size, hidden_size) * 0.01
self.W_hy = np.random.randn(output_size, hidden_size) * 0.01
self.b_h = np.zeros((hidden_size, 1))
self.b_y = np.zeros((output_size, 1))
def forward(self, x_sequence):
"""Forward pass through sequence"""
seq_len = len(x_sequence)
h_states = [np.zeros((self.hidden_size, 1))]
y_outputs = []
for t in range(seq_len):
# Compute hidden state
h_t = np.tanh(np.dot(self.W_xh, x_sequence[t]) +
np.dot(self.W_hh, h_states[-1]) +
self.b_h)
h_states.append(h_t)
# Compute output
y_t = np.dot(self.W_hy, h_t) + self.b_y
y_outputs.append(y_t)
return y_outputs, h_states
def backward(self, x_sequence, y_outputs, h_states, targets, learning_rate=0.01):
"""Backpropagation Through Time"""
seq_len = len(x_sequence)
# Initialize gradients
dW_xh = np.zeros_like(self.W_xh)
dW_hh = np.zeros_like(self.W_hh)
dW_hy = np.zeros_like(self.W_hy)
db_h = np.zeros_like(self.b_h)
db_y = np.zeros_like(self.b_y)
dh_next = np.zeros((self.hidden_size, 1))
# Backward through time
for t in reversed(range(seq_len)):
# Output layer gradient
dy = y_outputs[t] - targets[t]
dW_hy += np.dot(dy, h_states[t+1].T)
db_y += dy
# Hidden layer gradient
dh = np.dot(self.W_hy.T, dy) + dh_next
dh_raw = (1 - h_states[t+1]**2) * dh # tanh derivative
dW_xh += np.dot(dh_raw, x_sequence[t].T)
dW_hh += np.dot(dh_raw, h_states[t].T)
db_h += dh_raw
dh_next = np.dot(self.W_hh.T, dh_raw)
# Update weights
self.W_xh -= learning_rate * dW_xh
self.W_hh -= learning_rate * dW_hh
self.W_hy -= learning_rate * dW_hy
self.b_h -= learning_rate * db_h
self.b_y -= learning_rate * db_y
# Usage
rnn = RNN(input_size=10, hidden_size=20, output_size=5)
# Training loop would go here