Chapter 2: Self-Attention Mechanism

Attention is All You Need - How self-attention enables parallel processing

Learning Objectives

Understand the difference between attention and self-attention
Master how self-attention computes relationships within a sequence
Learn the scaled dot-product attention formula in detail
Understand how attention weights reveal relationships
Recognize why self-attention enables parallel processing
Implement self-attention from scratch

What is Self-Attention?

Attention Within a Sequence

Self-attention is attention applied to the same sequence - each position attends to all positions in the same sequence, including itself.

Key difference from regular attention:

Regular Attention: Query from one sequence, Keys/Values from another sequence (encoder-decoder)
Self-Attention: Query, Key, and Value all come from the same sequence

Think of self-attention like a group discussion:

Each person (word) can look at and consider what everyone else (all words) is saying
Each person forms their understanding based on the full context
This happens simultaneously for everyone - parallel processing!

📚 Example: Understanding "it" in Context

Sentence: "The cat sat on the mat because it was tired."

Self-Attention Process:

When processing "it", self-attention looks at ALL words in the sentence
It computes: "How relevant is each word for understanding 'it'?"
High attention weight to "cat" → "it" refers to "cat"
Lower attention weights to other words
Result: "it" gets a representation that includes information about "cat"

Self-Attention vs Regular Attention

🔍 The Key Distinction

Regular Attention (Encoder-Decoder)

Source: Encoder hidden states (e.g., French sentence)
Target: Decoder hidden state (e.g., English word being generated)
Query: From decoder (what am I looking for?)
Keys/Values: From encoder (what information is available?)
Use case: Translation, summarization

Self-Attention (Within Sequence)

Source: Same sequence (e.g., English sentence)
Target: Same sequence positions
Query, Key, Value: All from the same sequence
Use case: Understanding relationships within text, encoding context

Detailed Comparison

Architectural Differences

Regular Attention (Cross-Attention):

Two sequences: Source sequence and target sequence
Information flow: From source to target
Query source: Target sequence (decoder)
Key/Value source: Source sequence (encoder)
Purpose: Allow target to access source information
Example: When translating, decoder queries encoder states to find relevant source words

Self-Attention:

One sequence: Same sequence for all components
Information flow: Within the sequence itself
Query, Key, Value: All from the same sequence
Purpose: Allow each position to access all other positions
Example: In "The cat sat", "cat" can attend to "The" and "sat" to understand its role

When to Use Each

Regular Attention:

Sequence-to-sequence tasks (translation, summarization)
When you need to align two different sequences
Encoder-decoder architectures
Tasks requiring cross-sequence information flow

Self-Attention:

Understanding relationships within a single sequence
Creating contextualized word embeddings
Encoder-only models (BERT) or decoder-only models (GPT)
Tasks requiring intra-sequence understanding

Computational Differences

Regular Attention:

Q shape: (target_len, d_k)
K, V shape: (source_len, d_k)
Attention matrix: (target_len, source_len)
Complexity: O(target_len × source_len)

Self-Attention:

Q, K, V shape: (seq_len, d_k) - all same length!
Attention matrix: (seq_len, seq_len) - square matrix
Complexity: O(seq_len²)
Can process all positions in parallel

How Self-Attention Computes

Self-Attention Formula

\[\text{SelfAttention}(X) = \text{Attention}(Q, K, V)\] \[\text{where } Q = XW_Q, \quad K = XW_K, \quad V = XW_V\]

Step-by-Step Process:

Input: Sequence X (n × d_model)
Linear Projections: Create Q, K, V from X using learned weight matrices
Attention Scores: Compute QK^T (similarity between all positions)
Scale: Divide by √d_k to prevent large values
Softmax: Convert to probabilities (attention weights)
Weighted Sum: Multiply attention weights by V
Output: New representation for each position

Detailed Example: "The cat sat" with Visual Attention Flow

Input sequence (3 words):

🔄 Self-Attention Flow Diagram

Step 1: Create Q, K, V from Input

Input X

["The", "cat", "sat"]

→

Q, K, V

3 queries, 3 keys, 3 values

Step 2: Attention Flow for "cat" (position 1)

"The"

Score: 0.1

"cat"

Score: 0.9

"sat"

Score: 0.3

Query("cat") compares to all Keys → High similarity with Key("cat") → Strong attention!

Step 3 & 4: Softmax → Weighted Sum

0.15 × V("The")

0.70 × V("cat")

0.15 × V("sat")

Output("cat")

Contextualized!

Step-by-Step Breakdown:

Create Q, K, V: X = [embedding("The"), embedding("cat"), embedding("sat")] → Q, K, V (3 queries, 3 keys, 3 values)
Compute Attention Scores: For "cat" (position 1), Query("cat") compares to all Keys → Scores: [0.1, 0.9, 0.3]
Apply Softmax: Attention weights: [0.15, 0.70, 0.15] - "cat" attends most to itself, some to neighbors
Weighted Sum: Output("cat") = 0.15×V("The") + 0.70×V("cat") + 0.15×V("sat") - New representation combines information from all positions

Understanding Attention Weights

What Attention Weights Reveal

Attention weights show which positions are most relevant for understanding each position.

Example: Pronoun Resolution with Visual Attention Map

Sentence: "The cat sat on the mat because it was tired."

🎯 Attention Weight Visualization

Attention weights for word "it" (position 6):

The

0.05

cat

0.70

sat

0.03

0.02

the

0.02

mat

0.02

(self)

was

0.03

tired

0.20

💡 Interpretation: "it" attends most strongly to "cat" (0.70) - the model correctly identifies the referent! It also attends to "tired" (0.20) as a related concept. The bar heights represent attention weights.

Parallel Processing Advantage

Why Self-Attention is Fast

Self-attention can process all positions simultaneously, unlike RNNs which must process sequentially.

RNN vs Self-Attention

RNN (Sequential):

Process word 1 → wait → process word 2 → wait → process word 3
Time complexity: O(n) sequential steps
Cannot parallelize across sequence

Self-Attention (Parallel):

Process all words simultaneously
Time complexity: O(1) parallel steps (though O(n²) operations)
Can use GPU parallelism effectively

Self-Attention Implementation

Complete Self-Attention Implementation

import numpy as np

def self_attention(X, W_Q, W_K, W_V, d_k):
    """
    Self-attention mechanism
    
    Parameters:
    X: Input sequence (n, d_model)
    W_Q: Query weight matrix (d_model, d_k)
    W_K: Key weight matrix (d_model, d_k)
    W_V: Value weight matrix (d_model, d_v)
    d_k: Dimension of keys/queries
    """
    n = X.shape[0]
    
    # Step 1: Create Q, K, V from input
    Q = np.dot(X, W_Q)  # (n, d_k)
    K = np.dot(X, W_K)  # (n, d_k)
    V = np.dot(X, W_V)  # (n, d_v)
    
    # Step 2: Compute attention scores
    scores = np.dot(Q, K.T)  # (n, n)
    
    # Step 3: Scale
    scores = scores / np.sqrt(d_k)
    
    # Step 4: Apply softmax
    attention_weights = softmax(scores, axis=-1)  # (n, n)
    
    # Step 5: Weighted sum of values
    output = np.dot(attention_weights, V)  # (n, d_v)
    
    return output, attention_weights

def softmax(x, axis=-1):
    """Softmax function"""
    exp_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)

# Example usage
n, d_model, d_k, d_v = 10, 512, 64, 64
X = np.random.randn(n, d_model)
W_Q = np.random.randn(d_model, d_k) * 0.1
W_K = np.random.randn(d_model, d_k) * 0.1
W_V = np.random.randn(d_model, d_v) * 0.1

output, weights = self_attention(X, W_Q, W_K, W_V, d_k)
print(f"Output shape: {output.shape}")  # (10, 64)
print(f"Attention weights shape: {weights.shape}")  # (10, 10)

Test Your Understanding

Question 1: What is the key difference between self-attention and regular attention?

A) Self-attention uses Q, K, V from the same sequence

B) Self-attention is faster

C) Self-attention uses fewer parameters

D) Self-attention doesn't use softmax

Question 2: Why can self-attention process all positions in parallel?

A) Each position can attend to all positions independently

B) It uses less memory

C) It doesn't need gradients

D) It uses smaller matrices

Question 3: How does self-attention create contextualized word embeddings?

A) Each word's representation becomes a weighted combination of all words in the sequence, where weights are determined by how relevant each word is to understanding the current word's meaning

B) It just copies the input

C) It averages all words

D) It uses only the first word

Question 4: What is the formula for self-attention?

A) \(SelfAttention(X) = softmax(\frac{XW_Q (XW_K)^T}{\sqrt{d_k}}) XW_V\) where X is input, W_Q, W_K, W_V are learned weight matrices

B) \(SelfAttention = X\)

C) \(SelfAttention = W \times X\)

D) \(SelfAttention = X + W\)

Question 5: Why does the word "bank" get different representations in different contexts with self-attention?

A) Self-attention allows "bank" to attend to different surrounding words (financial terms vs river terms), creating context-specific representations that capture the word's meaning in that particular sentence

B) It always has the same representation

C) It randomly changes

D) It depends on position only

Question 6: How do you compute Q, K, V matrices in self-attention?

A) Multiply input embeddings X by learned weight matrices: Q = XW_Q, K = XW_K, V = XW_V, where all three come from the same input sequence

B) Q, K, V are the same

C) They come from different sequences

D) They're random

Question 7: What does the attention weight matrix show?

A) A square matrix where each row shows how much each position attends to all other positions, revealing which words are most relevant for understanding each word

B) Only word frequencies

C) Only positions

D) Random values

Question 8: How is self-attention different from word embeddings like Word2Vec?

A) Word2Vec gives each word a fixed embedding regardless of context, while self-attention produces context-dependent embeddings that change based on surrounding words

B) They're the same

C) Word2Vec is context-dependent

D) Self-attention is fixed

Question 9: What is the computational cost of self-attention for a sequence of length n?

A) O(n²) because we compute attention scores between every pair of positions, creating an n×n attention matrix

B) O(n)

C) O(1)

D) O(log n)

Question 10: How would you visualize attention weights to understand what a model learned?

A) Create a heatmap where rows are query positions, columns are key positions, and color intensity shows attention weight. Darker colors indicate stronger attention, revealing which words the model considers most relevant

B) Just look at numbers

C) Plot only the maximum

D) Can't visualize

Question 11: Why is self-attention parallelizable while RNNs are sequential?

A) Self-attention computes all attention scores simultaneously using matrix operations, while RNNs must process tokens one at a time since each step depends on the previous hidden state

B) RNNs are also parallel

C) Self-attention is sequential

D) No difference

Question 12: How would you implement self-attention from scratch in code?

A) Initialize weight matrices W_Q, W_K, W_V. Compute Q=XW_Q, K=XW_K, V=XW_V. Compute scores=QK^T, scale by sqrt(d_k), apply softmax, multiply by V. Return weighted sum as output

B) Just return input

C) Use only Q

D) Random operations