Learning Objectives

By the end of this chapter, you will be able to:

Explain the neural-network mechanism behind Recurrent Neural Networks (RNNs).
Connect Recurrent Neural Networks (RNNs) to training behavior and model performance.
Recognize implementation choices that affect stability and generalization.

Chapter 6: Recurrent Neural Networks (RNNs)

Networks with memory for sequential data processing

What are Recurrent Neural Networks?

Networks with Memory

Recurrent Neural Networks (RNNs) are designed to process sequential data by maintaining a hidden state that captures information about previous inputs. Unlike feedforward networks, RNNs have connections that form cycles, allowing information to persist.

Think of RNNs like reading a book:

Feedforward Network: Like reading random sentences from different pages - no context, no understanding of the story
RNN: Like reading page by page, remembering what happened before - you understand the plot because you remember previous chapters
Hidden State: Like your memory of the story so far - it influences how you understand the current page

The Fundamental Problem with Feedforward Networks

Why can't we just use regular neural networks for sequences?

Problem 1: Fixed Input Size

Feedforward networks require fixed-size inputs:

If your network expects 10 inputs, you MUST provide exactly 10
But sequences vary in length: "Hello" (5 characters) vs "Hello, how are you?" (19 characters)
Solution needed: Process one element at a time, regardless of sequence length

Problem 2: No Memory

Feedforward networks have no memory of previous inputs:

Example sentence: "The cat sat on the mat because it was tired."

To understand "it", you need to remember "cat" from earlier
Feedforward network: "What is 'it'?" → No idea, no memory of "cat"
RNN: "What is 'it'?" → Remembers "cat" from hidden state → "it" refers to "cat"

Problem 3: Order Matters

In sequences, order is critical:

"Dog bites man" vs "Man bites dog" - completely different meanings!
Feedforward network: Treats input as unordered set → can't distinguish
RNN: Processes in order → understands the difference

The RNN Solution: Three Key Innovations

1. Sequential Processing

RNNs process sequences one element at a time:

Input: "The cat sat"
Step 1: Process "The" → update hidden state
Step 2: Process "cat" → update hidden state (remembers "The")
Step 3: Process "sat" → update hidden state (remembers "The cat")
Key: Each step builds on previous context

2. Hidden State (Memory)

The hidden state is the RNN's memory:

It's a vector that encodes information from all previous inputs
Like a summary of "what has happened so far"
Updated at each time step to incorporate new information
Analogy: Like a running summary of a conversation - you remember key points, not every word

3. Shared Weights

The same weights are used at every time step:

Same transformation applied to each input
Like using the same "reading comprehension" skill for every word
Benefit: Much fewer parameters than having separate weights for each position
Analogy: Like using the same grammar rules for every sentence, regardless of position

📚 Real-World Analogy: The RNN as a Conversation

Imagine having a conversation:

Feedforward Network: Like talking to someone with amnesia - they respond to each sentence independently, forgetting everything you said before
RNN: Like a normal conversation - the person remembers what you discussed earlier and uses that context to understand your current statement

Example conversation:

You: "I have a pet."
RNN hidden state: [remembers: "pet mentioned"]
You: "It's very fluffy."
RNN: Uses hidden state → "It" must refer to the pet → understands correctly
You: "It likes to sleep."
RNN: Still remembers the pet → continues to understand "it"

🔄 RNN Sequential Flow Visualization

Input: "The"

h₀

[0, 0, 0]

→

Input: "cat"

h₁

[remembers "The"]

→

Input: "sat"

h₂

[remembers "The cat"]

💡 Key: Each hidden state (h₀, h₁, h₂) accumulates information from all previous inputs. The arrow (→) shows information flowing forward through time!

The Recurrence Relation

The Heart of RNNs

The recurrence relation is what makes RNNs special - it connects the current state to previous states, creating memory.

Think of it like updating your understanding:

Previous understanding (h_{t-1}): What you knew before
New information (x_t): What you're learning now
Updated understanding (h_t): Combines old and new information
Key insight: Your current understanding depends on both what you knew before AND what you're learning now

RNN Forward Pass - The Complete Formula

At each time step t, the RNN performs two operations:

\[Step 1: Update Hidden State \\ h_t = tanh(W_hh \\times h_{t-1} + W_xh \\times x_t + b_h) \\ \\ Step 2: Compute Output \\ y_t = W_hy \\times h_t + b_y\]

Breaking Down Step 1: Hidden State Update

h_t = tanh(W_hh × h_{t-1} + W_xh × x_t + b_h)

This formula has three parts:

W_hh × h_{t-1}: "What to remember from the past"
- Transforms previous hidden state
- Decides which parts of memory to keep
- Like filtering your memory: "What's still relevant?"
W_xh × x_t: "What to learn from the current input"
- Transforms current input
- Extracts relevant information
- Like taking notes: "What's important in this new information?"
+ b_h: Bias term (allows shifting the activation)
tanh(...): Activation function (squashes values to [-1, 1])

In plain English: "My new understanding = combine (filtered memory from before + processed new information), then normalize it"

Breaking Down Step 2: Output Computation

y_t = W_hy × h_t + b_y

This is simpler:

Take the updated hidden state
Transform it to produce the desired output
Like translating your understanding into an answer

Notation Reference:

h_t: Hidden state at time t (the memory/understanding)
h_{t-1}: Hidden state at previous time step (previous memory)
x_t: Input at time t (new information)
y_t: Output at time t (prediction/response)
W_hh: Hidden-to-hidden weight matrix (how to update memory)
W_xh: Input-to-hidden weight matrix (how to process new input)
W_hy: Hidden-to-output weight matrix (how to generate output)
b_h, b_y: Bias terms (allow shifting activations)

Step-by-Step Example: Processing "The cat sat"

Let's trace through how an RNN processes this sentence:

Time Step 1: Processing "The"

Input x₁: "The" (encoded as a vector, e.g., [0.2, 0.1, 0.8, ...])
Previous hidden state h₀: [0, 0, 0, ...] (initialized to zeros - no memory yet)
Compute h₁:
- W_hh × h₀ = [0, 0, 0, ...] (no previous memory)
- W_xh × x₁ = [0.1, 0.3, 0.2, ...] (processed "The")
- h₁ = tanh([0, 0, 0, ...] + [0.1, 0.3, 0.2, ...] + b_h)
- h₁ = [0.1, 0.3, 0.2, ...] (memory now contains information about "The")
Output y₁: W_hy × h₁ + b_y (might predict next word or classify)

Time Step 2: Processing "cat"

Input x₂: "cat" (encoded as [0.5, 0.7, 0.1, ...])
Previous hidden state h₁: [0.1, 0.3, 0.2, ...] (remembers "The")
Compute h₂:
- W_hh × h₁ = [0.05, 0.15, 0.1, ...] (filtered memory of "The")
- W_xh × x₂ = [0.3, 0.4, 0.2, ...] (processed "cat")
- h₂ = tanh([0.05, 0.15, 0.1, ...] + [0.3, 0.4, 0.2, ...] + b_h)
- h₂ = [0.35, 0.55, 0.3, ...] (memory now contains "The cat")
Key insight: h₂ combines information from both "The" and "cat"!

Time Step 3: Processing "sat"

Input x₃: "sat" (encoded as [0.2, 0.3, 0.6, ...])
Previous hidden state h₂: [0.35, 0.55, 0.3, ...] (remembers "The cat")
Compute h₃:
- W_hh × h₂ = [0.18, 0.28, 0.15, ...] (filtered memory of "The cat")
- W_xh × x₃ = [0.1, 0.15, 0.3, ...] (processed "sat")
- h₃ = tanh([0.18, 0.28, 0.15, ...] + [0.1, 0.15, 0.3, ...] + b_h)
- h₃ = [0.28, 0.43, 0.45, ...] (memory now contains "The cat sat")
Result: The hidden state has accumulated information from the entire sequence!

🎯 Key Takeaways

Information accumulates: Each time step adds to the memory
Context builds: Later predictions can use information from much earlier
Same weights: W_hh, W_xh, W_hy are used at every time step
Variable length: Can process sequences of any length using the same process

Simple RNN Implementation

import numpy as np

class SimpleRNN:
    """Simple Recurrent Neural Network"""
    
    def __init__(self, input_size, hidden_size, output_size):
        # Weight matrices
        self.W_xh = np.random.randn(hidden_size, input_size) * 0.1
        self.W_hh = np.random.randn(hidden_size, hidden_size) * 0.1
        self.W_hy = np.random.randn(output_size, hidden_size) * 0.1
        
        # Biases
        self.b_h = np.zeros((hidden_size, 1))
        self.b_y = np.zeros((output_size, 1))
        
        # Hidden state (initialized to zero)
        self.h = np.zeros((hidden_size, 1))
    
    def tanh(self, x):
        return np.tanh(x)
    
    def forward_step(self, x):
        """Process one time step"""
        # Update hidden state
        self.h = self.tanh(np.dot(self.W_xh, x) + 
                          np.dot(self.W_hh, self.h) + 
                          self.b_h)
        
        # Compute output
        y = np.dot(self.W_hy, self.h) + self.b_y
        
        return y, self.h
    
    def forward(self, sequence):
        """Process entire sequence"""
        outputs = []
        self.h = np.zeros((self.W_hh.shape[0], 1))  # Reset state
        
        for x in sequence:
            y, h = self.forward_step(x)
            outputs.append(y)
        
        return outputs

# Example: Process sequence
rnn = SimpleRNN(input_size=3, hidden_size=4, output_size=2)
sequence = [np.array([[1], [2], [3]]), 
            np.array([[4], [5], [6]]),
            np.array([[7], [8], [9]])]

outputs = rnn.forward(sequence)
print(f"Processed {len(outputs)} time steps")

Unfolding RNNs Through Time

📖 Visualizing Recurrence

Unfolding (or unrolling) an RNN means visualizing it as a feedforward network where each time step becomes a separate layer. This helps understand how information flows through time.

Unfolding Example

For a sequence of length 3:

t=0: h₀ = tanh(W_xh × x₀ + W_hh × h_{-1} + b_h)
t=1: h₁ = tanh(W_xh × x₁ + W_hh × h₀ + b_h)
t=2: h₂ = tanh(W_xh × x₂ + W_hh × h₁ + b_h)

Unfolded view:

x₀ → [RNN] → h₀ → [RNN] → h₁ → [RNN] → h₂
      ↑            ↑            ↑
      └────────────┴────────────┘
      (shared weights W_hh)

Backpropagation Through Time (BPTT)

Training RNNs

Backpropagation Through Time (BPTT) is the algorithm used to train RNNs. It's essentially backpropagation applied to the unfolded network, computing gradients across all time steps.

BPTT Gradient Computation

Gradient flows backward through time:

\[\\partialL/\\partialh_t = \\partialL/\\partialy_t \\times W_hy + \\partialL/\\partialh_{t+1} \\times W_hh\]

Key Insight:

Gradient at time t depends on gradient at time t+1
Must compute gradients backward through entire sequence
Gradients are multiplied by W_hh at each step
This can cause vanishing or exploding gradients

The Vanishing Gradient Problem

⚠️ RNN's Main Limitation

The vanishing gradient problem occurs when gradients become exponentially small as they propagate backward through time. This makes it difficult for RNNs to learn long-term dependencies.

Why Gradients Vanish

Gradient at time t-k:

\[\\partialL/\\partialh_{t-k} = \\partialL/\\partialh_t \\times (W_hh)^k \\times tanh'(z_{t-k}) \\times ... \\times tanh'(z_t)\]

Problem:

Gradient is multiplied by W_hh k times
If |W_hh| < 1, gradient shrinks exponentially
tanh'(z) ≤ 1, further reducing gradient
Result: Early time steps receive almost no gradient

📚 Real-World Impact

Example: Language Modeling

Sentence: "The cat, which was very fluffy, sat on the mat."

To predict "sat", the network needs to remember "cat" from much earlier. With vanishing gradients, the network can't learn this long-term dependency!

Solution: LSTMs and GRUs (covered in Chapter 7) solve this problem.

Complete RNN Implementation

Full RNN with BPTT

import numpy as np

class RNN:
    """Recurrent Neural Network with Backpropagation Through Time"""
    
    def __init__(self, input_size, hidden_size, output_size):
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        
        # Initialize weights
        self.W_xh = np.random.randn(hidden_size, input_size) * 0.01
        self.W_hh = np.random.randn(hidden_size, hidden_size) * 0.01
        self.W_hy = np.random.randn(output_size, hidden_size) * 0.01
        
        self.b_h = np.zeros((hidden_size, 1))
        self.b_y = np.zeros((output_size, 1))
    
    def forward(self, x_sequence):
        """Forward pass through sequence"""
        seq_len = len(x_sequence)
        h_states = [np.zeros((self.hidden_size, 1))]
        y_outputs = []
        
        for t in range(seq_len):
            # Compute hidden state
            h_t = np.tanh(np.dot(self.W_xh, x_sequence[t]) + 
                         np.dot(self.W_hh, h_states[-1]) + 
                         self.b_h)
            h_states.append(h_t)
            
            # Compute output
            y_t = np.dot(self.W_hy, h_t) + self.b_y
            y_outputs.append(y_t)
        
        return y_outputs, h_states
    
    def backward(self, x_sequence, y_outputs, h_states, targets, learning_rate=0.01):
        """Backpropagation Through Time"""
        seq_len = len(x_sequence)
        
        # Initialize gradients
        dW_xh = np.zeros_like(self.W_xh)
        dW_hh = np.zeros_like(self.W_hh)
        dW_hy = np.zeros_like(self.W_hy)
        db_h = np.zeros_like(self.b_h)
        db_y = np.zeros_like(self.b_y)
        
        dh_next = np.zeros((self.hidden_size, 1))
        
        # Backward through time
        for t in reversed(range(seq_len)):
            # Output layer gradient
            dy = y_outputs[t] - targets[t]
            dW_hy += np.dot(dy, h_states[t+1].T)
            db_y += dy
            
            # Hidden layer gradient
            dh = np.dot(self.W_hy.T, dy) + dh_next
            dh_raw = (1 - h_states[t+1]**2) * dh  # tanh derivative
            
            dW_xh += np.dot(dh_raw, x_sequence[t].T)
            dW_hh += np.dot(dh_raw, h_states[t].T)
            db_h += dh_raw
            
            dh_next = np.dot(self.W_hh.T, dh_raw)
        
        # Update weights
        self.W_xh -= learning_rate * dW_xh
        self.W_hh -= learning_rate * dW_hh
        self.W_hy -= learning_rate * dW_hy
        self.b_h -= learning_rate * db_h
        self.b_y -= learning_rate * db_y

# Usage
rnn = RNN(input_size=10, hidden_size=20, output_size=5)
# Training loop would go here

Test Your Understanding

Question 1: What is the main advantage of RNNs over feedforward networks?

A) They are faster to train

B) They can process sequences and maintain memory of previous inputs

C) They require fewer parameters

D) They always give better accuracy

Question 2: What is the vanishing gradient problem in RNNs?

A) Gradients become exponentially small when propagating backward through time

B) RNNs forget previous inputs too quickly

C) RNNs require too much memory

D) RNNs can't process sequences

Question 3: What does BPTT stand for?

A) Backpropagation Through Training

B) Backpropagation Through Time

C) Backward Propagation Through Time

D) Best Performance Through Training

Question 4: How does an RNN maintain memory of previous inputs?

A) RNNs use a hidden state that gets updated at each time step, carrying information from previous steps forward through the sequence

B) By storing all previous inputs

C) By using separate networks

D) They don't maintain memory

Question 5: What is the mathematical formula for RNN hidden state update?

A) \(h_t = f(W_h h_{t-1} + W_x x_t + b)\) where f is activation, W_h and W_x are weight matrices, h is hidden state, x is input

B) \(h_t = x_t\)

C) \(h_t = W \times x_t\)

D) \(h_t = h_{t-1}\)

Question 6: Why do RNNs struggle with long sequences?

A) Vanishing and exploding gradients make it hard to learn long-term dependencies, and the simple hidden state update struggles to retain information over many steps

B) They're too fast

C) They use too many parameters

D) They can't process sequences

Question 7: What are the different RNN architectures for sequence tasks?

A) Many-to-one (sequence to single output like classification), one-to-many (single input to sequence like generation), many-to-many (sequence to sequence like translation), with different input/output patterns

B) Only many-to-one

C) Only one-to-many

D) All are the same

Question 8: How does backpropagation through time work in RNNs?

A) Gradients are computed at each time step and propagated backward through the unrolled network, with gradients accumulating across time steps, requiring careful handling of the shared weights

B) Only backward through one step

C) No backpropagation needed

D) Only forward pass

Question 9: What is the difference between unidirectional and bidirectional RNNs?

A) Unidirectional processes sequence left-to-right only, while bidirectional processes both directions simultaneously, allowing the network to use both past and future context

B) They're the same

C) Unidirectional uses both directions

D) Bidirectional is one-way only

Question 10: When would you use an RNN versus a feedforward network?

A) Use RNNs for sequential data where order matters (text, time series, speech), and feedforward networks for fixed-size inputs where order doesn't matter (images, tabular data)

B) Always use RNNs

C) Always use feedforward

D) They're interchangeable

Question 11: What causes exploding gradients in RNNs and how do you fix it?

A) When gradients grow exponentially through time steps, often due to large weights. Fix with gradient clipping (cap gradient magnitude), careful weight initialization, or use LSTM/GRU architectures

B) Can't be fixed

C) Increase learning rate

D) Use more layers

Question 12: How would you implement a simple RNN from scratch?

A) Initialize weight matrices for input-to-hidden, hidden-to-hidden, and hidden-to-output. For each time step, compute hidden state using previous hidden state and current input, apply activation, compute output. During training, unroll the network and use backpropagation through time

B) Just use a feedforward network

C) Only forward pass needed

D) No implementation needed