Chapter 10: Transformer Variants & Optimizations
Beyond the Original
Learning Objectives
- Understand transformer variants & optimizations fundamentals
- Master the mathematical foundations
- Learn practical implementation
- Apply knowledge through examples
- Recognize real-world applications
Transformer Variants & Optimizations
Introduction
Beyond the Original
This chapter provides comprehensive coverage of transformer variants & optimizations, including detailed explanations, mathematical formulations, code implementations, and real-world examples.
š Why This Matters
Understanding transformer variants & optimizations is crucial for mastering modern AI systems. This chapter breaks down complex concepts into digestible explanations with step-by-step examples.
Key Concepts
Transformer Variants
Many architectures extend the original transformer:
- BERT: Bidirectional encoder for understanding tasks
- GPT: Autoregressive decoder for generation
- T5: Encoder-decoder for text-to-text tasks
- RoBERTa: Optimized BERT training
- ALBERT: Parameter-sharing for efficiency
Optimization Techniques
Key optimizations improve efficiency:
- Sparse Attention: Reduce computation by attending to fewer positions
- Linear Attention: Approximate attention with linear complexity
- Quantization: Use lower precision to reduce memory
- Pruning: Remove less important weights
- Knowledge Distillation: Train smaller models from larger ones
Scaling Strategies
Methods to scale transformers:
- Model Parallelism: Distribute model across devices
- Pipeline Parallelism: Split layers across GPUs
- Mixed Precision: Use float16 for speed, float32 for stability
- Gradient Checkpointing: Trade compute for memory
Mathematical Formulations
Efficient Attention Variants
Various optimizations reduce the quadratic complexity of standard attention, making transformers more efficient for long sequences.
Model Parallelism
Parallelization strategies enable training and inference of very large models across multiple GPUs/TPUs.
Quantization
Where \(W_q\) is quantized weight, \(W\) is original weight, and \(s\) is the quantization scale factor
Quantization reduces model size and memory by using lower precision (e.g., int8 instead of float32) while maintaining reasonable accuracy.
Knowledge Distillation
Where \(P\) are probability distributions from teacher and student models, and \(\alpha\) balances task loss and distillation loss
Knowledge distillation trains smaller student models to mimic larger teacher models, transferring knowledge while reducing size.
Detailed Examples
Example: Sparse Attention Pattern
Instead of full attention (all-to-all), sparse attention might use:
- Local attention: Each token attends to nearby tokens (window of size w)
- Strided attention: Attend to every k-th token
- Global attention: Some tokens attend to all positions
This reduces computation from O(n²) to O(nĆw) for local attention.
Example: Model Variants Comparison
BERT vs GPT vs T5:
- BERT: "The cat sat" ā [CLS] token for classification
- GPT: "The cat sat" ā predicts "on" (next token)
- T5: "translate: The cat sat" ā "Le chat s'est assis"
Each architecture is optimized for different task types.
Example: Quantization Impact
Original model: 1 billion parameters Ć 4 bytes (float32) = 4GB
Quantized (int8): 1 billion parameters Ć 1 byte = 1GB
4Ć reduction in memory with minimal accuracy loss when done carefully.
Implementation
Sparse Attention Implementation
import torch
import torch.nn as nn
def sparse_attention(Q, K, V, window_size=3):
"""
Sparse attention with local window
Args:
Q, K, V: Query, Key, Value tensors (batch, seq_len, d_model)
window_size: Number of nearby tokens to attend to
"""
seq_len = Q.size(1)
scores = torch.matmul(Q, K.transpose(-2, -1))
# Create sparse mask (only attend to nearby tokens)
mask = torch.zeros(seq_len, seq_len)
for i in range(seq_len):
start = max(0, i - window_size // 2)
end = min(seq_len, i + window_size // 2 + 1)
mask[i, start:end] = 1
# Apply mask
scores = scores.masked_fill(mask == 0, float('-inf'))
scores = scores / (Q.size(-1) ** 0.5)
attn_weights = torch.softmax(scores, dim=-1)
output = torch.matmul(attn_weights, V)
return output
Quantization Example
import numpy as np
def quantize_weights(weights, bits=8):
"""
Quantize weights to int8
"""
# Calculate scale factor
max_val = np.abs(weights).max()
scale = (2 ** (bits - 1) - 1) / max_val
# Quantize
quantized = np.round(weights * scale).astype(np.int8)
return quantized, scale
def dequantize_weights(quantized, scale):
"""
Dequantize back to float
"""
return quantized.astype(np.float32) / scale
# Example
weights = np.random.randn(100, 100).astype(np.float32)
quantized, scale = quantize_weights(weights)
reconstructed = dequantize_weights(quantized, scale)
print(f"Original size: {weights.nbytes} bytes")
print(f"Quantized size: {quantized.nbytes} bytes")
print(f"Compression: {weights.nbytes / quantized.nbytes:.1f}x")
Real-World Applications
BERT Applications
Understanding tasks:
- Sentiment analysis in customer reviews
- Named entity recognition in documents
- Question answering systems
- Text classification and tagging
GPT Applications
Generation tasks:
- Chatbots and conversational AI
- Code generation and completion
- Content creation and writing assistance
- Text summarization
T5 Applications
Text-to-text tasks:
- Machine translation
- Text summarization
- Question answering
- Text classification (formatted as generation)
Optimization Benefits
Efficiency improvements enable:
- Faster inference for real-time applications
- Deployment on edge devices
- Reduced computational costs
- Handling longer sequences