Chapter 2: Self-Attention Mechanism
Attention is All You Need - How self-attention enables parallel processing
Learning Objectives
- Understand the difference between attention and self-attention
- Master how self-attention computes relationships within a sequence
- Learn the scaled dot-product attention formula in detail
- Understand how attention weights reveal relationships
- Recognize why self-attention enables parallel processing
- Implement self-attention from scratch
What is Self-Attention?
Attention Within a Sequence
Self-attention is attention applied to the same sequence - each position attends to all positions in the same sequence, including itself.
Key difference from regular attention:
- Regular Attention: Query from one sequence, Keys/Values from another sequence (encoder-decoder)
- Self-Attention: Query, Key, and Value all come from the same sequence
Think of self-attention like a group discussion:
- Each person (word) can look at and consider what everyone else (all words) is saying
- Each person forms their understanding based on the full context
- This happens simultaneously for everyone - parallel processing!
π Example: Understanding "it" in Context
Sentence: "The cat sat on the mat because it was tired."
Self-Attention Process:
- When processing "it", self-attention looks at ALL words in the sentence
- It computes: "How relevant is each word for understanding 'it'?"
- High attention weight to "cat" β "it" refers to "cat"
- Lower attention weights to other words
- Result: "it" gets a representation that includes information about "cat"
Self-Attention vs Regular Attention
π The Key Distinction
Regular Attention (Encoder-Decoder)
- Source: Encoder hidden states (e.g., French sentence)
- Target: Decoder hidden state (e.g., English word being generated)
- Query: From decoder (what am I looking for?)
- Keys/Values: From encoder (what information is available?)
- Use case: Translation, summarization
Self-Attention (Within Sequence)
- Source: Same sequence (e.g., English sentence)
- Target: Same sequence positions
- Query, Key, Value: All from the same sequence
- Use case: Understanding relationships within text, encoding context
Detailed Comparison
Architectural Differences
Regular Attention (Cross-Attention):
- Two sequences: Source sequence and target sequence
- Information flow: From source to target
- Query source: Target sequence (decoder)
- Key/Value source: Source sequence (encoder)
- Purpose: Allow target to access source information
- Example: When translating, decoder queries encoder states to find relevant source words
Self-Attention:
- One sequence: Same sequence for all components
- Information flow: Within the sequence itself
- Query, Key, Value: All from the same sequence
- Purpose: Allow each position to access all other positions
- Example: In "The cat sat", "cat" can attend to "The" and "sat" to understand its role
When to Use Each
Regular Attention:
- Sequence-to-sequence tasks (translation, summarization)
- When you need to align two different sequences
- Encoder-decoder architectures
- Tasks requiring cross-sequence information flow
Self-Attention:
- Understanding relationships within a single sequence
- Creating contextualized word embeddings
- Encoder-only models (BERT) or decoder-only models (GPT)
- Tasks requiring intra-sequence understanding
Computational Differences
Regular Attention:
- Q shape: (target_len, d_k)
- K, V shape: (source_len, d_k)
- Attention matrix: (target_len, source_len)
- Complexity: O(target_len Γ source_len)
Self-Attention:
- Q, K, V shape: (seq_len, d_k) - all same length!
- Attention matrix: (seq_len, seq_len) - square matrix
- Complexity: O(seq_lenΒ²)
- Can process all positions in parallel
How Self-Attention Computes
Self-Attention Formula
Step-by-Step Process:
- Input: Sequence X (n Γ d_model)
- Linear Projections: Create Q, K, V from X using learned weight matrices
- Attention Scores: Compute QK^T (similarity between all positions)
- Scale: Divide by βd_k to prevent large values
- Softmax: Convert to probabilities (attention weights)
- Weighted Sum: Multiply attention weights by V
- Output: New representation for each position
Detailed Example: "The cat sat" with Visual Attention Flow
Input sequence (3 words):
π Self-Attention Flow Diagram
Step 1: Create Q, K, V from Input
Input X
["The", "cat", "sat"]
Q, K, V
3 queries, 3 keys, 3 values
Step 2: Attention Flow for "cat" (position 1)
"The"
Score: 0.1
"cat"
Score: 0.9
"sat"
Score: 0.3
Query("cat") compares to all Keys β High similarity with Key("cat") β Strong attention!
Step 3 & 4: Softmax β Weighted Sum
0.15 Γ V("The")
0.70 Γ V("cat")
0.15 Γ V("sat")
Output("cat")
Contextualized!
Step-by-Step Breakdown:
- Create Q, K, V: X = [embedding("The"), embedding("cat"), embedding("sat")] β Q, K, V (3 queries, 3 keys, 3 values)
- Compute Attention Scores: For "cat" (position 1), Query("cat") compares to all Keys β Scores: [0.1, 0.9, 0.3]
- Apply Softmax: Attention weights: [0.15, 0.70, 0.15] - "cat" attends most to itself, some to neighbors
- Weighted Sum: Output("cat") = 0.15ΓV("The") + 0.70ΓV("cat") + 0.15ΓV("sat") - New representation combines information from all positions
Understanding Attention Weights
What Attention Weights Reveal
Attention weights show which positions are most relevant for understanding each position.
Example: Pronoun Resolution with Visual Attention Map
Sentence: "The cat sat on the mat because it was tired."
π― Attention Weight Visualization
Attention weights for word "it" (position 6):
The
0.05
cat
0.70
sat
0.03
on
0.02
the
0.02
mat
0.02
it
(self)
was
0.03
tired
0.20
π‘ Interpretation: "it" attends most strongly to "cat" (0.70) - the model correctly identifies the referent! It also attends to "tired" (0.20) as a related concept. The bar heights represent attention weights.
Parallel Processing Advantage
Why Self-Attention is Fast
Self-attention can process all positions simultaneously, unlike RNNs which must process sequentially.
RNN vs Self-Attention
RNN (Sequential):
- Process word 1 β wait β process word 2 β wait β process word 3
- Time complexity: O(n) sequential steps
- Cannot parallelize across sequence
Self-Attention (Parallel):
- Process all words simultaneously
- Time complexity: O(1) parallel steps (though O(nΒ²) operations)
- Can use GPU parallelism effectively
Self-Attention Implementation
Complete Self-Attention Implementation
import numpy as np
def self_attention(X, W_Q, W_K, W_V, d_k):
"""
Self-attention mechanism
Parameters:
X: Input sequence (n, d_model)
W_Q: Query weight matrix (d_model, d_k)
W_K: Key weight matrix (d_model, d_k)
W_V: Value weight matrix (d_model, d_v)
d_k: Dimension of keys/queries
"""
n = X.shape[0]
# Step 1: Create Q, K, V from input
Q = np.dot(X, W_Q) # (n, d_k)
K = np.dot(X, W_K) # (n, d_k)
V = np.dot(X, W_V) # (n, d_v)
# Step 2: Compute attention scores
scores = np.dot(Q, K.T) # (n, n)
# Step 3: Scale
scores = scores / np.sqrt(d_k)
# Step 4: Apply softmax
attention_weights = softmax(scores, axis=-1) # (n, n)
# Step 5: Weighted sum of values
output = np.dot(attention_weights, V) # (n, d_v)
return output, attention_weights
def softmax(x, axis=-1):
"""Softmax function"""
exp_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
return exp_x / np.sum(exp_x, axis=axis, keepdims=True)
# Example usage
n, d_model, d_k, d_v = 10, 512, 64, 64
X = np.random.randn(n, d_model)
W_Q = np.random.randn(d_model, d_k) * 0.1
W_K = np.random.randn(d_model, d_k) * 0.1
W_V = np.random.randn(d_model, d_v) * 0.1
output, weights = self_attention(X, W_Q, W_K, W_V, d_k)
print(f"Output shape: {output.shape}") # (10, 64)
print(f"Attention weights shape: {weights.shape}") # (10, 10)