Chapter 1: The Attention Mechanism
Foundation of Modern NLP - Understanding how attention revolutionized sequence modeling
Learning Objectives
- Understand why attention was needed in sequence-to-sequence models
- Master the Query, Key, Value (QKV) framework
- Learn the mathematical formulation of attention
- Understand how attention solves the information bottleneck
- Implement attention mechanism from scratch
- Recognize the limitations of RNNs/LSTMs that attention addresses
What is Attention?
The Revolutionary Idea
Attention is a mechanism that allows models to focus on relevant parts of the input when making predictions. It was introduced to solve a critical limitation in sequence-to-sequence models: the information bottleneck.
Think of attention like reading comprehension:
- Without attention: Like trying to summarize a book after reading it once - you remember only the main points, lose details
- With attention: Like having the book open while writing - you can look back at any part when needed
- Key insight: When generating each word, the model can "attend to" (look at) any part of the input sequence
š Real-World Example: Translation
Translating: "The cat sat on the mat" ā "Le chat s'est assis sur le tapis"
Without Attention (Old RNN Approach):
- Encoder processes entire sentence into one fixed-size vector
- Decoder must generate translation from this single vector
- Problem: Long sentences get compressed into same-size vector ā information loss
- Like trying to fit a 1000-page book summary into one sentence!
With Attention (Modern Approach):
- Encoder keeps all hidden states (one per input word)
- When generating "chat", decoder attends to "cat" in input
- When generating "tapis", decoder attends to "mat" in input
- Result: Each output word can access any input word directly!
The Problem Attention Solves
ā ļø The Information Bottleneck
Sequence-to-sequence models (like RNNs) had a fundamental limitation:
The Bottleneck Problem
Old architecture (RNN encoder-decoder):
- Encoder: Processes input sequence word by word
- Final hidden state: Single vector representing entire input
- Decoder: Must generate output from this single vector
Why this is a problem:
- Short sentence: "Hello" ā Easy to compress into one vector
- Long sentence: "The quick brown fox jumps over the lazy dog and then runs through the forest..." ā Same-size vector!
- Result: Information loss for long sequences
- Like trying to summarize War and Peace in one sentence
Detailed Analysis of the Bottleneck
The fundamental issue: Traditional encoder-decoder architectures compress all information from the source sequence into a fixed-size context vector, regardless of sequence length. This creates several critical problems:
1. Information Loss in Long Sequences
Consider translating a 50-word sentence:
- Encoder processes: Word 1 ā Word 2 ā ... ā Word 50
- Each step: Hidden state updates, but earlier information gets diluted
- Final state: Must encode all 50 words in same-size vector as a 5-word sentence
- Result: Early words and fine details are lost or compressed beyond recognition
2. Fixed Context Window
The context vector has a fixed dimension (e.g., 512 or 1024), which means:
- Short sequences: Most of the vector space is unused or redundant
- Long sequences: Critical information must be discarded or averaged out
- No adaptive allocation of representation capacity based on sequence complexity
3. Sequential Processing Bottleneck
RNNs process sequences sequentially, which means:
- Cannot parallelize computation across sequence positions
- Long-range dependencies require many steps, leading to vanishing gradients
- Each word's representation depends on all previous words, creating a dependency chain
4. Real-World Impact
Translation example: Translating "The cat that the dog that the bird saw chased sat on the mat"
- Complex nested structure with multiple relative clauses
- RNN encoder: Must compress all relationships into one vector
- Decoder: Tries to reconstruct meaning from compressed representation
- Problem: Which "that" refers to which noun? This information is lost!
Why This Matters for NLP Tasks
Different tasks suffer differently from the bottleneck:
Machine Translation
- Long sentences lose word alignment information
- Complex grammatical structures get flattened
- Idiomatic expressions and context-dependent meanings are lost
Text Summarization
- Important details from early in the document are forgotten
- Key relationships between distant sentences are lost
- Summary quality degrades significantly for long documents
Question Answering
- Questions about specific details require accessing exact source positions
- Single vector cannot preserve all positional information
- Answer quality drops for questions requiring long-range context
The Attention Concept
š How Attention Works
Attention allows the decoder to look at ALL encoder hidden states, not just the final one:
Attention Mechanism
For each decoder step:
- Compute attention scores for all encoder positions
- Convert scores to attention weights (probabilities)
- Create weighted combination of encoder hidden states
- Use this combination (context vector) for decoding
Breaking Down the Attention Process
Attention fundamentally changes how information flows from encoder to decoder:
Traditional Approach (Without Attention)
- Encoder: Input sequence ā Final hidden state (single vector)
- Decoder: Final hidden state ā Output sequence
- Information flow: All input ā One vector ā All output
- Problem: Information bottleneck at the single vector
Attention-Based Approach
- Encoder: Input sequence ā All hidden states (one per input position)
- Decoder: For each output position, dynamically selects relevant encoder states
- Information flow: All input states ā Dynamic selection ā Each output position
- Advantage: Each output can access any input position directly
Step-by-Step Attention Process
Step 1: Compute Attention Scores
For each decoder position, compute how relevant each encoder position is:
- Compare decoder's current state with all encoder hidden states
- Generate a score for each encoder position
- Higher score = more relevant for current decoding step
- Result: A vector of scores (one per encoder position)
Step 2: Normalize to Attention Weights
Convert raw scores into probabilities using softmax:
- Apply softmax to attention scores
- Ensures all weights sum to 1 (probability distribution)
- Higher scores get exponentially higher weights
- Result: Attention weights that sum to 1.0
Step 3: Create Context Vector
Weighted combination of encoder hidden states:
- Multiply each encoder hidden state by its attention weight
- Sum all weighted encoder states
- This creates a context vector specific to current decoder position
- Result: A context vector that emphasizes relevant input positions
Step 4: Use Context for Decoding
Combine context vector with decoder's current state:
- Context vector provides relevant source information
- Decoder state provides target-side context
- Combined information guides next token generation
- Result: Better translation/decoding quality
Why Attention Solves the Bottleneck
Key advantages of attention over fixed context vector:
1. Dynamic Information Access
- Each decoder position can focus on different encoder positions
- No fixed compression - information remains accessible
- Model learns which positions to attend to for each output
2. Preserves All Information
- All encoder hidden states are available, not just the final one
- No information is lost through compression
- Early and late positions in input are equally accessible
3. Interpretable Alignment
- Attention weights show which input words align with which output words
- Provides interpretability - we can visualize what the model focuses on
- Helps debug and understand model behavior
4. Handles Variable Lengths
- Works equally well for short and long sequences
- No fixed-size bottleneck
- Scales naturally with sequence length
Query, Key, Value (QKV) Framework
š The Three Components
Attention uses three vectors: Query, Key, and Value
š Library Analogy
Think of attention like searching a library:
- Query (Q): Your question - "I need information about cats"
- Key (K): Book titles/index - Each book has a title describing its content
- Value (V): Book content - The actual information in each book
The process:
- Compare your Query to all Keys (book titles)
- Find which Keys match your Query best
- Retrieve the Values (content) from matching books
- Combine the relevant Values to answer your question
Understanding Query, Key, and Value
Query (Q) - "What am I looking for?"
In sequence-to-sequence models:
- Query comes from the decoder's current hidden state
- Represents: "What information do I need right now to generate the next word?"
- Example: When generating "chat" (French for "cat"), the query asks "What in the source sentence relates to what I'm translating?"
- Shape: (1, d_k) for one query, or (n, d_k) for multiple queries
Key (K) - "What information is available?"
In sequence-to-sequence models:
- Keys come from encoder hidden states
- Represents: "What information does each source position contain?"
- Example: Each source word has a key that describes its content and relevance
- Shape: (m, d_k) where m is source sequence length
Value (V) - "What is the actual information?"
In sequence-to-sequence models:
- Values also come from encoder hidden states
- Represents: "The actual content/representation at each source position"
- Example: The full semantic representation of each source word
- Shape: (m, d_v) where m is source sequence length
Why Separate Q, K, and V?
Key insight: Query and Key determine relevance, Value provides the information
1. Separation of Concerns
- Q and K: Used together to compute "how relevant is this?"
- V: Used to provide "what is the actual information?"
- This separation allows the model to learn different representations for matching vs. content
2. Flexibility in Representation
- Q, K, V can have different learned projections
- Model can learn: "How to describe what I'm looking for" (Q) vs "How to describe what I have" (K) vs "What information to extract" (V)
- This flexibility improves model capacity and performance
3. Real-World Example
Translating: "The cat sat on the mat" ā "Le chat s'est assis sur le tapis"
- When generating "chat":
- Query: "I'm generating a noun, what should it be?"
- Keys: ["The" ā article, "cat" ā noun, "sat" ā verb, ...]
- Attention: High score for "cat" key (matches query for noun)
- Value: The full representation of "cat" (semantic, grammatical info)
- Result: Context vector emphasizes "cat" information
Computing Q, K, V from Input
In practice, Q, K, V are computed from input embeddings using learned linear transformations:
Linear Projections
- Input: Embeddings X (e.g., from encoder or decoder)
- Query: Q = X Ć W_Q (learned weight matrix)
- Key: K = X Ć W_K (learned weight matrix)
- Value: V = X Ć W_V (learned weight matrix)
- Why different matrices? Allows model to learn different "views" of the same input
Dimension Considerations
- Q and K typically have same dimension (d_k) for dot product compatibility
- V can have different dimension (d_v), though often d_v = d_k
- Common: d_k = d_v = 64 (for each head in multi-head attention)
- Original Transformer: d_model = 512, d_k = d_v = 64
Attention Formula
Scaled Dot-Product Attention
Step-by-Step Breakdown:
- QK^T: Compute similarity between queries and keys (dot product)
- / ād_k: Scale by square root of key dimension (prevents large values)
- softmax(...): Convert scores to probabilities (attention weights)
- Ć V: Weighted sum of values based on attention weights
Deep Dive into Each Component
1. Computing QK^T (Dot Product Similarity)
What it does: Measures how similar each query is to each key
- Mathematical operation: Matrix multiplication of Q and K^T (transpose of K)
- Result shape: (num_queries, num_keys)
- Each element: Dot product between one query and one key
- Higher values: Query and key are more similar/relevant
Example: If Q has 1 query and K has 5 keys:
- QK^T produces 5 scores: [score_1, score_2, score_3, score_4, score_5]
- score_1 = how similar query is to key_1
- score_2 = how similar query is to key_2
- And so on...
2. Scaling by ād_k (Why It's Critical)
The problem without scaling:
- Dot products grow with dimension d_k
- For large d_k, dot products can be very large (e.g., 100, 200, 500+)
- Large values push softmax into saturation regions
- Result: Very small gradients, making training difficult
The solution - scaling:
- Divide by ād_k to normalize the scale
- Keeps dot products in a reasonable range (typically -10 to +10)
- Softmax operates in a region with good gradients
- Result: Stable training and better gradients
Mathematical intuition: If Q and K have dimension d_k, their dot product has variance proportional to d_k. Dividing by ād_k normalizes the variance to 1.
3. Softmax Normalization
What softmax does: Converts raw scores into a probability distribution
- Input: Raw attention scores (can be any real numbers)
- Output: Probabilities that sum to 1.0
- Formula: softmax(x_i) = exp(x_i) / Σ exp(x_j)
- Effect: Higher scores get exponentially higher probabilities
Example: Scores [2, 1, 0.5] ā Softmax ā [0.58, 0.32, 0.10]
- Highest score (2) gets 58% of attention
- Medium score (1) gets 32%
- Lowest score (0.5) gets 10%
- All sum to 1.0 (probability distribution)
4. Weighted Sum with Values
Final step: Multiply attention weights by values and sum
- Attention weights: [0.58, 0.32, 0.10] (from softmax)
- Values: [Vā, Vā, Vā] (encoder hidden states)
- Output: 0.58ĆVā + 0.32ĆVā + 0.10ĆVā
- Result: A context vector that emphasizes Vā (most relevant)
Why this works:
- More relevant positions (higher attention weights) contribute more
- Less relevant positions contribute less
- Creates a focused representation for the current decoding step
Complete Mathematical Flow
Let's trace through a concrete example:
Example: Translating "cat" ā "chat"
Setup:
- Source: "The cat sat on the mat" (5 words)
- Currently generating: "chat" (French for "cat")
- d_k = 64
Step 1: Compute QK^T
- Q: Query from decoder (1 query vector, 64-dim)
- K: Keys from encoder (5 key vectors, each 64-dim)
- QK^T: [0.2, 8.5, 1.1, 0.3, 0.5] (5 scores, one per source word)
- High score (8.5) at position 1 ā "cat" is highly relevant!
Step 2: Scale by ād_k
- ā64 = 8
- Scaled scores: [0.025, 1.0625, 0.1375, 0.0375, 0.0625]
- Scores are now in a reasonable range
Step 3: Apply Softmax
- Softmax([0.025, 1.0625, 0.1375, 0.0375, 0.0625])
- Attention weights: [0.05, 0.70, 0.10, 0.05, 0.10]
- 70% attention on "cat" (position 1) - exactly what we want!
Step 4: Weighted Sum
- 0.05ĆV("The") + 0.70ĆV("cat") + 0.10ĆV("sat") + 0.05ĆV("on") + 0.10ĆV("the")
- Context vector heavily emphasizes "cat" information
- Decoder uses this to generate "chat"
Implementation
Attention Implementation
import numpy as np
def attention(Q, K, V):
"""
Scaled dot-product attention
Parameters:
Q: Query matrix (seq_len_q, d_k)
K: Key matrix (seq_len_k, d_k)
V: Value matrix (seq_len_v, d_v)
"""
d_k = K.shape[-1]
# Compute attention scores
scores = np.dot(Q, K.T) / np.sqrt(d_k)
# Apply softmax
attention_weights = softmax(scores, axis=-1)
# Weighted sum of values
output = np.dot(attention_weights, V)
return output, attention_weights
def softmax(x, axis=-1):
"""Softmax function"""
exp_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
return exp_x / np.sum(exp_x, axis=axis, keepdims=True)
# Example usage
Q = np.random.randn(5, 64) # 5 queries, 64 dimensions
K = np.random.randn(10, 64) # 10 keys, 64 dimensions
V = np.random.randn(10, 64) # 10 values, 64 dimensions
output, weights = attention(Q, K, V)
print(f"Output shape: {output.shape}") # (5, 64)
print(f"Attention weights shape: {weights.shape}") # (5, 10)