Chapter 4: Positional Encoding
Adding Order Information
Learning Objectives
- Understand positional encoding fundamentals
- Master the mathematical foundations
- Learn practical implementation
- Apply knowledge through examples
- Recognize real-world applications
Positional Encoding
Why Position Matters
Transformers have no inherent sense of word order - self-attention treats all positions equally. Positional encoding adds information about the position of each token in the sequence, allowing the model to understand order-dependent relationships.
Think of positional encoding like page numbers in a book:
- Without page numbers: You know what's written, but not where it appears
- With page numbers: You know both content and position
- In transformers: Positional encoding tells the model "this word is at position 3"
📚 Why Order Matters
Example sentences with same words, different order:
- "Dog bites man" vs "Man bites dog" - completely different meanings!
- "The cat sat on the mat" vs "The mat sat on the cat" - nonsensical without order
Problem: Self-attention without positional encoding would treat these as identical!
Solution: Add positional encoding to preserve order information
Two Types of Positional Encoding
1. Sinusoidal (Fixed) Positional Encoding
Used in original Transformer paper:
- Mathematical functions (sine and cosine) generate position encodings
- No learnable parameters
- Can extrapolate to longer sequences than seen during training
- Like a mathematical pattern that encodes position
2. Learned Positional Embeddings
Used in BERT and many modern models:
- Learnable parameters (like word embeddings)
- Model learns optimal position representations
- Fixed maximum sequence length
- Like learning a lookup table for positions
Key Concepts
Why Positional Encoding is Necessary
Transformers have no inherent sense of word order:
Unlike RNNs or CNNs, transformers process all tokens in parallel. Without positional encoding, the model cannot distinguish between "cat sat on mat" and "mat sat on cat" - both would be identical to the model!
Positional encoding adds order information:
- Each position gets a unique encoding vector
- This encoding is added to the token embedding
- The model learns to use this information to understand sequence order
- Enables understanding of word order, syntax, and temporal relationships
Sinusoidal vs Learned Positional Encoding
Original Transformer used sinusoidal (fixed) encoding:
- Mathematical functions (sin/cos) generate encodings
- Deterministic - same position always gets same encoding
- Can extrapolate to longer sequences than seen during training
- No additional parameters to learn
Modern models often use learned positional embeddings:
- Learnable parameters that are optimized during training
- More flexible - can learn optimal position representations
- Limited to maximum sequence length seen during training
- Requires additional parameters
How Positional Encoding Works
The encoding is added (not concatenated) to token embeddings:
- Token embedding: [0.2, -0.5, 0.8, ...] (512 dimensions)
- Position encoding: [0.1, 0.3, -0.2, ...] (512 dimensions)
- Final embedding: [0.3, -0.2, 0.6, ...] (element-wise addition)
- This preserves the embedding space while adding position info
Why Addition Instead of Concatenation?
Mathematical and practical reasons:
- Dimension preservation: Addition keeps d_model constant, concatenation doubles it
- Computational efficiency: Smaller matrices in subsequent operations
- Information mixing: Model learns to separate token and position information
- Empirical performance: Addition works as well or better than concatenation
How the Model Separates Information
The model learns to distinguish token vs position information:
- During training, the model sees many tokens at different positions
- It learns: "This part of the embedding is about the word, that part is about position"
- The learned transformations can separate and use both types of information
- Like learning to read both the content and page number of a book simultaneously
Relative vs Absolute Positional Encoding
Two approaches to encoding position:
Absolute Positional Encoding (Original Transformer)
- What it encodes: Exact position in sequence (0, 1, 2, ...)
- Example: Position 5 always gets the same encoding
- Advantage: Simple, direct
- Limitation: Doesn't explicitly encode relative distances
Relative Positional Encoding (Modern Variants)
- What it encodes: Distance between positions (i - j)
- Example: Encodes "these two words are 3 positions apart"
- Advantage: Better generalization, captures relative relationships
- Implementation: Modifies attention scores directly
Mathematical Formulations
Sinusoidal Positional Encoding
Notation:
- pos: Position in the sequence (0, 1, 2, ...)
- i: Dimension index (0, 1, 2, ..., d_model/2 - 1)
- d_model: Embedding dimension (e.g., 512)
- 2i: Even dimensions use sine
- 2i+1: Odd dimensions use cosine
Why Sine and Cosine?
- Creates unique patterns for each position
- Relative positions can be computed: PE(pos+k) can be derived from PE(pos)
- Allows model to understand relative distances
Understanding the Formula Components:
- pos: The absolute position (0, 1, 2, ...)
- 10000^{2i/d_model}: Creates different frequencies for different dimensions
- Lower i (dimensions 0, 1): Higher frequency, captures fine-grained position differences
- Higher i (dimensions 254, 255): Lower frequency, captures coarse position differences
- Sine and cosine: Provide complementary information, enable relative position computation
Final Embedding
How It Works:
- Token embedding: Semantic meaning of the word
- Positional encoding: Position information
- Addition: Combines both types of information
- Result: Each token has both semantic and positional information
Mathematical Properties:
- Element-wise addition: Each dimension adds independently
- Preserves norms: Roughly maintains embedding magnitude
- Learnable separation: Model learns to use both components
Relative Position Property
Key Insight:
The sinusoidal encoding has a special property: the encoding for position (pos + k) can be computed from the encodings for position pos and position k using trigonometric identities. This allows the model to understand relative positions even if it hasn't seen that exact absolute position during training.
Mathematical Derivation:
Using trigonometric addition formulas:
- sin(a + b) = sin(a)cos(b) + cos(a)sin(b)
- cos(a + b) = cos(a)cos(b) - sin(a)sin(b)
- This enables computing relative positions from absolute positions
Detailed Examples
Example: Positional Encoding for "The cat sat"
Input sequence: ["The", "cat", "sat"] (3 tokens)
Embedding dimension: 512
Step 1: Token Embeddings
- "The" → [0.1, -0.3, 0.5, ..., 0.2] (512-dim vector)
- "cat" → [0.4, 0.1, -0.2, ..., 0.3] (512-dim vector)
- "sat" → [-0.1, 0.5, 0.3, ..., -0.1] (512-dim vector)
Step 2: Positional Encodings
- Position 0: PE(0) = [sin(0), cos(0), sin(0.0001), cos(0.0001), ...]
- Position 1: PE(1) = [sin(0.001), cos(0.001), sin(0.0002), cos(0.0002), ...]
- Position 2: PE(2) = [sin(0.002), cos(0.002), sin(0.0004), cos(0.0004), ...]
Step 3: Add Encodings
- "The" at pos 0: TokenEmbed("The") + PE(0)
- "cat" at pos 1: TokenEmbed("cat") + PE(1)
- "sat" at pos 2: TokenEmbed("sat") + PE(2)
Result: Each token now has position information embedded in its representation, allowing the model to understand word order.
Example: Why Addition (Not Concatenation)?
If we concatenated:
- Token embedding: 512 dimensions
- Position encoding: 512 dimensions
- Result: 1024 dimensions (doubles size!)
With addition:
- Token embedding: 512 dimensions
- Position encoding: 512 dimensions
- Result: 512 dimensions (same size, information combined)
Why it works: The model learns to separate token and position information during training. The addition creates a combined representation that preserves both types of information.
Implementation
Sinusoidal Positional Encoding Implementation
import numpy as np
import math
def get_positional_encoding(seq_len, d_model):
"""
Generate sinusoidal positional encoding
Parameters:
seq_len: Maximum sequence length
d_model: Embedding dimension
Returns:
Positional encoding matrix (seq_len, d_model)
"""
pe = np.zeros((seq_len, d_model))
for pos in range(seq_len):
for i in range(0, d_model, 2):
# Even dimensions: sine
pe[pos, i] = math.sin(pos / (10000 ** ((2 * i) / d_model)))
# Odd dimensions: cosine
if i + 1 < d_model:
pe[pos, i + 1] = math.cos(pos / (10000 ** ((2 * i) / d_model)))
return pe
# Example: Generate positional encoding for sequence length 10, dimension 512
seq_len = 10
d_model = 512
positional_encoding = get_positional_encoding(seq_len, d_model)
print(f"Positional encoding shape: {positional_encoding.shape}") # (10, 512)
print(f"First position encoding (first 10 dims): {positional_encoding[0, :10]}")
print(f"Second position encoding (first 10 dims): {positional_encoding[1, :10]}")
# Visualize: Each position has a unique encoding pattern
# Position 0: [0.0, 1.0, 0.0, 1.0, ...]
# Position 1: [0.841, 0.540, 0.002, 0.999, ...]
# Position 2: [0.909, -0.416, 0.004, 0.999, ...]
Learned Positional Embeddings (BERT-style)
import numpy as np
class LearnedPositionalEmbedding:
"""Learnable positional embeddings"""
def __init__(self, max_seq_len, d_model):
self.max_seq_len = max_seq_len
self.d_model = d_model
# Initialize positional embeddings (learnable parameters)
self.position_embeddings = np.random.randn(max_seq_len, d_model) * 0.02
def get_embeddings(self, seq_len):
"""Get positional embeddings for sequence"""
if seq_len > self.max_seq_len:
raise ValueError(f"Sequence length {seq_len} exceeds max {self.max_seq_len}")
return self.position_embeddings[:seq_len, :]
def forward(self, token_embeddings, positions=None):
"""
Add positional encoding to token embeddings
Parameters:
token_embeddings: (batch, seq_len, d_model)
positions: Optional position indices
"""
batch_size, seq_len, d_model = token_embeddings.shape
if positions is None:
# Use sequential positions
pos_emb = self.get_embeddings(seq_len)
else:
# Use provided positions
pos_emb = self.position_embeddings[positions, :]
# Add positional encoding
return token_embeddings + pos_emb
# Example usage
max_seq_len, d_model = 512, 768
pos_embedding = LearnedPositionalEmbedding(max_seq_len, d_model)
# Token embeddings (batch=2, seq_len=10, d_model=768)
token_emb = np.random.randn(2, 10, d_model)
# Add positional encoding
output = pos_embedding.forward(token_emb)
print(f"Output shape: {output.shape}") # (2, 10, 768)
Real-World Applications
Positional Encoding in Modern Models
All transformer-based models use positional encoding:
- BERT: Uses learned positional embeddings (max 512 tokens)
- GPT: Uses learned positional embeddings (varies by model size)
- Original Transformer: Used sinusoidal encoding (can handle any length)
- T5: Uses learned positional embeddings
Handling Variable Sequence Lengths
Positional encoding enables:
- Processing sequences of different lengths
- Understanding relative positions (near vs far)
- Capturing temporal order in time-series data
- Maintaining word order in translation tasks
Limitations and Solutions
Learned positional embeddings:
- Limited to maximum sequence length seen during training
- Cannot extrapolate to longer sequences
- Solution: Use relative positional encoding or extend embeddings
Sinusoidal encoding:
- Can handle any sequence length
- But may not be optimal for specific tasks
- Solution: Often replaced with learned embeddings for better performance