Chapter 4: Positional Encoding

Adding Order Information

Learning Objectives

Understand positional encoding fundamentals
Master the mathematical foundations
Learn practical implementation
Apply knowledge through examples
Recognize real-world applications

Positional Encoding

Why Position Matters

Transformers have no inherent sense of word order - self-attention treats all positions equally. Positional encoding adds information about the position of each token in the sequence, allowing the model to understand order-dependent relationships.

Think of positional encoding like page numbers in a book:

Without page numbers: You know what's written, but not where it appears
With page numbers: You know both content and position
In transformers: Positional encoding tells the model "this word is at position 3"

📚 Why Order Matters

Example sentences with same words, different order:

"Dog bites man" vs "Man bites dog" - completely different meanings!
"The cat sat on the mat" vs "The mat sat on the cat" - nonsensical without order

Problem: Self-attention without positional encoding would treat these as identical!

Solution: Add positional encoding to preserve order information

Two Types of Positional Encoding

1. Sinusoidal (Fixed) Positional Encoding

Used in original Transformer paper:

Mathematical functions (sine and cosine) generate position encodings
No learnable parameters
Can extrapolate to longer sequences than seen during training
Like a mathematical pattern that encodes position

2. Learned Positional Embeddings

Used in BERT and many modern models:

Learnable parameters (like word embeddings)
Model learns optimal position representations
Fixed maximum sequence length
Like learning a lookup table for positions

Key Concepts

Why Positional Encoding is Necessary

Transformers have no inherent sense of word order:

Unlike RNNs or CNNs, transformers process all tokens in parallel. Without positional encoding, the model cannot distinguish between "cat sat on mat" and "mat sat on cat" - both would be identical to the model!

Positional encoding adds order information:

Each position gets a unique encoding vector
This encoding is added to the token embedding
The model learns to use this information to understand sequence order
Enables understanding of word order, syntax, and temporal relationships

Sinusoidal vs Learned Positional Encoding

Original Transformer used sinusoidal (fixed) encoding:

Mathematical functions (sin/cos) generate encodings
Deterministic - same position always gets same encoding
Can extrapolate to longer sequences than seen during training
No additional parameters to learn

Modern models often use learned positional embeddings:

Learnable parameters that are optimized during training
More flexible - can learn optimal position representations
Limited to maximum sequence length seen during training
Requires additional parameters

How Positional Encoding Works

The encoding is added (not concatenated) to token embeddings:

Token embedding: [0.2, -0.5, 0.8, ...] (512 dimensions)
Position encoding: [0.1, 0.3, -0.2, ...] (512 dimensions)
Final embedding: [0.3, -0.2, 0.6, ...] (element-wise addition)
This preserves the embedding space while adding position info

Why Addition Instead of Concatenation?

Mathematical and practical reasons:

Dimension preservation: Addition keeps d_model constant, concatenation doubles it
Computational efficiency: Smaller matrices in subsequent operations
Information mixing: Model learns to separate token and position information
Empirical performance: Addition works as well or better than concatenation

How the Model Separates Information

The model learns to distinguish token vs position information:

During training, the model sees many tokens at different positions
It learns: "This part of the embedding is about the word, that part is about position"
The learned transformations can separate and use both types of information
Like learning to read both the content and page number of a book simultaneously

Relative vs Absolute Positional Encoding

Two approaches to encoding position:

Absolute Positional Encoding (Original Transformer)

What it encodes: Exact position in sequence (0, 1, 2, ...)
Example: Position 5 always gets the same encoding
Advantage: Simple, direct
Limitation: Doesn't explicitly encode relative distances

Relative Positional Encoding (Modern Variants)

What it encodes: Distance between positions (i - j)
Example: Encodes "these two words are 3 positions apart"
Advantage: Better generalization, captures relative relationships
Implementation: Modifies attention scores directly

Mathematical Formulations

Sinusoidal Positional Encoding

\[PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)\]

\[PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)\]

Notation:

pos: Position in the sequence (0, 1, 2, ...)
i: Dimension index (0, 1, 2, ..., d_model/2 - 1)
d_model: Embedding dimension (e.g., 512)
2i: Even dimensions use sine
2i+1: Odd dimensions use cosine

Why Sine and Cosine?

Creates unique patterns for each position
Relative positions can be computed: PE(pos+k) can be derived from PE(pos)
Allows model to understand relative distances

Understanding the Formula Components:

pos: The absolute position (0, 1, 2, ...)
10000^{2i/d_model}: Creates different frequencies for different dimensions
Lower i (dimensions 0, 1): Higher frequency, captures fine-grained position differences
Higher i (dimensions 254, 255): Lower frequency, captures coarse position differences
Sine and cosine: Provide complementary information, enable relative position computation

Final Embedding

\[\text{Embedding}(token, pos) = \text{TokenEmbedding}(token) + PE(pos)\]

How It Works:

Token embedding: Semantic meaning of the word
Positional encoding: Position information
Addition: Combines both types of information
Result: Each token has both semantic and positional information

Mathematical Properties:

Element-wise addition: Each dimension adds independently
Preserves norms: Roughly maintains embedding magnitude
Learnable separation: Model learns to use both components

Relative Position Property

\[PE_{pos+k} = f(PE_{pos}, PE_k)\]

Key Insight:

The sinusoidal encoding has a special property: the encoding for position (pos + k) can be computed from the encodings for position pos and position k using trigonometric identities. This allows the model to understand relative positions even if it hasn't seen that exact absolute position during training.

Mathematical Derivation:

Using trigonometric addition formulas:

sin(a + b) = sin(a)cos(b) + cos(a)sin(b)
cos(a + b) = cos(a)cos(b) - sin(a)sin(b)
This enables computing relative positions from absolute positions

Detailed Examples

Example: Positional Encoding for "The cat sat"

Input sequence: ["The", "cat", "sat"] (3 tokens)

Embedding dimension: 512

Step 1: Token Embeddings

"The" → [0.1, -0.3, 0.5, ..., 0.2] (512-dim vector)
"cat" → [0.4, 0.1, -0.2, ..., 0.3] (512-dim vector)
"sat" → [-0.1, 0.5, 0.3, ..., -0.1] (512-dim vector)

Step 2: Positional Encodings

Position 0: PE(0) = [sin(0), cos(0), sin(0.0001), cos(0.0001), ...]
Position 1: PE(1) = [sin(0.001), cos(0.001), sin(0.0002), cos(0.0002), ...]
Position 2: PE(2) = [sin(0.002), cos(0.002), sin(0.0004), cos(0.0004), ...]

Step 3: Add Encodings

"The" at pos 0: TokenEmbed("The") + PE(0)
"cat" at pos 1: TokenEmbed("cat") + PE(1)
"sat" at pos 2: TokenEmbed("sat") + PE(2)

Result: Each token now has position information embedded in its representation, allowing the model to understand word order.

Example: Why Addition (Not Concatenation)?

If we concatenated:

Token embedding: 512 dimensions
Position encoding: 512 dimensions
Result: 1024 dimensions (doubles size!)

With addition:

Token embedding: 512 dimensions
Position encoding: 512 dimensions
Result: 512 dimensions (same size, information combined)

Why it works: The model learns to separate token and position information during training. The addition creates a combined representation that preserves both types of information.

Implementation

Sinusoidal Positional Encoding Implementation

import numpy as np
import math

def get_positional_encoding(seq_len, d_model):
    """
    Generate sinusoidal positional encoding
    
    Parameters:
    seq_len: Maximum sequence length
    d_model: Embedding dimension
    
    Returns:
    Positional encoding matrix (seq_len, d_model)
    """
    pe = np.zeros((seq_len, d_model))
    
    for pos in range(seq_len):
        for i in range(0, d_model, 2):
            # Even dimensions: sine
            pe[pos, i] = math.sin(pos / (10000 ** ((2 * i) / d_model)))
            
            # Odd dimensions: cosine
            if i + 1 < d_model:
                pe[pos, i + 1] = math.cos(pos / (10000 ** ((2 * i) / d_model)))
    
    return pe

# Example: Generate positional encoding for sequence length 10, dimension 512
seq_len = 10
d_model = 512
positional_encoding = get_positional_encoding(seq_len, d_model)

print(f"Positional encoding shape: {positional_encoding.shape}")  # (10, 512)
print(f"First position encoding (first 10 dims): {positional_encoding[0, :10]}")
print(f"Second position encoding (first 10 dims): {positional_encoding[1, :10]}")

# Visualize: Each position has a unique encoding pattern
# Position 0: [0.0, 1.0, 0.0, 1.0, ...]
# Position 1: [0.841, 0.540, 0.002, 0.999, ...]
# Position 2: [0.909, -0.416, 0.004, 0.999, ...]

Learned Positional Embeddings (BERT-style)

import numpy as np

class LearnedPositionalEmbedding:
    """Learnable positional embeddings"""
    
    def __init__(self, max_seq_len, d_model):
        self.max_seq_len = max_seq_len
        self.d_model = d_model
        
        # Initialize positional embeddings (learnable parameters)
        self.position_embeddings = np.random.randn(max_seq_len, d_model) * 0.02
    
    def get_embeddings(self, seq_len):
        """Get positional embeddings for sequence"""
        if seq_len > self.max_seq_len:
            raise ValueError(f"Sequence length {seq_len} exceeds max {self.max_seq_len}")
        
        return self.position_embeddings[:seq_len, :]
    
    def forward(self, token_embeddings, positions=None):
        """
        Add positional encoding to token embeddings
        
        Parameters:
        token_embeddings: (batch, seq_len, d_model)
        positions: Optional position indices
        """
        batch_size, seq_len, d_model = token_embeddings.shape
        
        if positions is None:
            # Use sequential positions
            pos_emb = self.get_embeddings(seq_len)
        else:
            # Use provided positions
            pos_emb = self.position_embeddings[positions, :]
        
        # Add positional encoding
        return token_embeddings + pos_emb

# Example usage
max_seq_len, d_model = 512, 768
pos_embedding = LearnedPositionalEmbedding(max_seq_len, d_model)

# Token embeddings (batch=2, seq_len=10, d_model=768)
token_emb = np.random.randn(2, 10, d_model)

# Add positional encoding
output = pos_embedding.forward(token_emb)
print(f"Output shape: {output.shape}")  # (2, 10, 768)

Real-World Applications

Positional Encoding in Modern Models

All transformer-based models use positional encoding:

BERT: Uses learned positional embeddings (max 512 tokens)
GPT: Uses learned positional embeddings (varies by model size)
Original Transformer: Used sinusoidal encoding (can handle any length)
T5: Uses learned positional embeddings

Handling Variable Sequence Lengths

Positional encoding enables:

Processing sequences of different lengths
Understanding relative positions (near vs far)
Capturing temporal order in time-series data
Maintaining word order in translation tasks

Limitations and Solutions

Learned positional embeddings:

Limited to maximum sequence length seen during training
Cannot extrapolate to longer sequences
Solution: Use relative positional encoding or extend embeddings

Sinusoidal encoding:

Can handle any sequence length
But may not be optimal for specific tasks
Solution: Often replaced with learned embeddings for better performance

Test Your Understanding

Question 1: Why do transformers need positional encoding?

A) Attention mechanisms are permutation-invariant and don't naturally understand word order, so positional encoding adds information about token positions in the sequence

B) To make computation faster

C) To reduce memory

D) It's not needed

Question 2: What is the formula for sinusoidal positional encoding?

A) \(PE_{(pos, 2i)} = sin(pos / 10000^{2i/d_{model}})\) and \(PE_{(pos, 2i+1)} = cos(pos / 10000^{2i/d_{model}})\) where pos is position and i is dimension

B) \(PE = pos\)

C) \(PE = sin(pos)\)

D) \(PE = random\)

Question 3: How do you add positional encoding to token embeddings?

A) Element-wise addition: final_embedding = token_embedding + positional_encoding, where both have the same dimension

B) Concatenate them

C) Multiply them

D) Use only positional encoding

Question 4: Why use sinusoidal encoding instead of learned positional embeddings?

A) Sinusoidal encoding can extrapolate to longer sequences than seen during training, while learned embeddings are fixed to training sequence length. However, learned embeddings often work better in practice

B) Sinusoidal is always better

C) Learned is always better

D) They're the same

Question 5: What happens if you don't use positional encoding in a transformer?

A) The model treats "cat sat mat" and "mat sat cat" as identical, losing all word order information, which is crucial for understanding language

B) Nothing changes

C) It works better

D) It becomes faster

Question 6: How does positional encoding help with relative positions?

A) The sinusoidal patterns create unique encodings for each position, and the model can learn to compute relative positions from these encodings through attention mechanisms

B) It doesn't help

C) Only absolute positions

D) Random positions

Question 7: What is the difference between absolute and relative positional encoding?

A) Absolute encoding adds position information directly to embeddings, while relative encoding modifies attention scores to encode distances between positions. Relative encoding can be more flexible

B) They're the same

C) Absolute is always better

D) Relative is always better

Question 8: How would you implement positional encoding from scratch?

A) For each position pos and dimension i, compute sin and cos values using the formula, create a matrix of shape (max_len, d_model), then add this to token embeddings during forward pass

B) Just use random values

C) Use only position numbers

D) No implementation needed

Question 9: Why do different frequency components in sinusoidal encoding matter?

A) Different frequencies (via the 10000^{2i/d} term) create unique patterns for each position. Lower frequencies capture coarse position, higher frequencies capture fine-grained position differences

B) They don't matter

C) Only one frequency needed

D) Random frequencies

Question 10: What happens to positional encoding in different transformer architectures?

A) BERT uses learned positional embeddings, GPT uses learned, original Transformer used sinusoidal. Some newer models use relative positional encoding or rotary position encoding (RoPE)

B) All use the same

C) None use it

D) Only sinusoidal

Question 11: How does positional encoding scale to very long sequences?

A) Sinusoidal encoding can theoretically handle any length, but in practice models are trained on fixed max lengths. For longer sequences, you may need to extend learned embeddings or use relative encoding

B) It doesn't scale

C) Always works perfectly

D) Only for short sequences

Question 12: How would you debug issues related to positional encoding?

A) Verify encoding values are in reasonable range, check that encoding is actually added to embeddings, test if model performance degrades when positions are shuffled, visualize positional encoding patterns, ensure encoding dimension matches embedding dimension

B) Just ignore it

C) Remove it

D) Use random values