Chapter 9: Complete Transformer Architecture
Putting It All Together
Learning Objectives
- Understand complete transformer architecture fundamentals
- Master the mathematical foundations
- Learn practical implementation
- Apply knowledge through examples
- Recognize real-world applications
Complete Transformer Architecture
The Complete Picture
The Transformer combines all components we've learned: embeddings, positional encoding, multi-head attention, FFN, residuals, and layer normalization into a powerful architecture.
šļø Complete Transformer Architecture Diagram
Input Tokens
["The", "cat", "sat"]
Token Embeddings
+ Positional Encoding
Encoder Stack (N layers)
Encoder Layer 1:
Multi-Head
Attention
Feed-Forward
Network
... (N-2 more layers) ...
Encoder Layer N:
Multi-Head
Attention
Feed-Forward
Network
Output Representations
Rich, contextualized
š” Key Components: Each encoder layer has Multi-Head Attention + FFN, both with residual connections and layer normalization. The stack processes input through N layers, building increasingly rich representations!
Information Flow Through Transformer
Complete data flow:
- Input: Tokenized text ā ["The", "cat", "sat"]
- Embeddings: Each token ā dense vector (512 dimensions)
- Positional Encoding: Add position information
- Encoder Layer 1: Multi-head attention + FFN ā Refined representations
- Encoder Layer 2-N: Further refinement through each layer
- Output: Rich, contextualized representations ready for tasks
Key Concepts
Complete Transformer Architecture
The full transformer combines all components:
- Input Processing: Tokenization ā Embeddings ā Positional Encoding
- Encoder Stack: Multiple encoder layers (self-attention + FFN)
- Decoder Stack: Multiple decoder layers (masked self-attention + cross-attention + FFN)
- Output Generation: Linear projection ā Softmax ā Token prediction
Information Flow Through Layers
Early layers: Capture local patterns, syntax, word-level relationships
Middle layers: Build phrase-level understanding, semantic relationships
Deep layers: Develop high-level abstractions, task-specific features
Each layer refines and abstracts the representation from the previous layer.
Training Challenges
Key challenges in training transformers:
- Memory: Large models require significant GPU memory
- Compute: Training takes weeks to months on many GPUs
- Data: Need massive, high-quality datasets
- Hyperparameters: Learning rate, warmup, batch size all critical
- Stability: Deep networks prone to gradient issues
Mathematical Formulations
Complete Transformer Forward Pass
Encoder Stack:
- \(X_{\text{enc}} = \text{Embed}(X_{\text{enc}}) + PE\)
- \(H = \text{EncoderLayer}_N(\ldots\text{EncoderLayer}_1(X_{\text{enc}}))\)
Decoder Stack:
- \(X_{\text{dec}} = \text{Embed}(X_{\text{dec}}) + PE\)
- \(O = \text{DecoderLayer}_N(\ldots\text{DecoderLayer}_1(X_{\text{dec}}, H))\)
Output:
- \(\text{Output} = \text{Softmax}(O \cdot W_{\text{out}})\)
Training Loss
Where:
- \(N\): Number of training examples
- \(y_i\): Target token at position i
- \(x_i\): Input context up to position i
- \(\theta\): Model parameters
- \(P(y_i | x_i, \theta)\): Model's predicted probability
This is the standard cross-entropy loss for language modeling.
Learning Rate Schedule
Phases:
- Warmup: Gradually increase from 0 to max learning rate
- Constant: Maintain max learning rate
- Decay: Gradually decrease learning rate
Detailed Examples
Example: Complete Transformer Processing
Task: Translate "Hello world" to French
Step 1: Input Processing
- Encoder input: "Hello world" ā Token IDs ā Embeddings ā + Positional Encoding
- Decoder input: [START] ā Token ID ā Embedding ā + Positional Encoding
Step 2: Encoder Processing
- Layer 1: Self-attention captures relationships between "Hello" and "world"
- Layer 2-6: Further refinement of representations
- Output: Rich contextualized representation of input
Step 3: Decoder Processing
- Masked self-attention: [START] attends to itself
- Cross-attention: [START] attends to encoder output
- FFN: Processes the combined information
- Output: Probability distribution over French vocabulary
Step 4: Generation
- Sample "Bonjour" (highest probability)
- Add to decoder input: [START, "Bonjour"]
- Repeat until [END] token or max length
Example: Training Setup
Typical configuration for large transformer:
- Model size: 12 layers, 768 dimensions, 12 heads
- Batch size: 256 (with gradient accumulation)
- Learning rate: 1e-4 with warmup to 1e-3
- Optimizer: AdamW with weight decay 0.01
- Training time: 1-2 weeks on 8 GPUs
- Data: Millions of sentence pairs
Implementation
Complete Transformer Training Loop
import torch
import torch.nn as nn
from torch.optim import AdamW
from torch.optim.lr_scheduler import LambdaLR
def train_transformer(model, train_loader, num_epochs, warmup_steps, total_steps):
"""
Training loop for transformer model
"""
optimizer = AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)
# Learning rate schedule with warmup
def lr_lambda(step):
if step < warmup_steps:
return step / warmup_steps # Warmup
else:
return (total_steps - step) / (total_steps - warmup_steps) # Decay
scheduler = LambdaLR(optimizer, lr_lambda)
criterion = nn.CrossEntropyLoss(ignore_index=-1)
model.train()
for epoch in range(num_epochs):
total_loss = 0
for batch_idx, (src, tgt) in enumerate(train_loader):
# Forward pass
output = model(src, tgt[:, :-1]) # Exclude last token
target = tgt[:, 1:] # Shift by one for next token prediction
# Compute loss
loss = criterion(output.reshape(-1, output.size(-1)), target.reshape(-1))
# Backward pass
optimizer.zero_grad()
loss.backward()
# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()
total_loss += loss.item()
if batch_idx % 100 == 0:
print(f'Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}')
print(f'Epoch {epoch} average loss: {total_loss / len(train_loader):.4f}')
# Example usage
# model = TransformerModel(vocab_size=50000, d_model=512, nhead=8, num_layers=6)
# train_transformer(model, train_loader, num_epochs=10, warmup_steps=4000, total_steps=100000)
Real-World Applications
Complete Transformer Applications
Encoder-decoder transformers are used for:
- Machine Translation: Google Translate, DeepL use transformer architectures
- Text Summarization: Generating concise summaries from long documents
- Question Answering: Systems that answer questions based on context
- Dialogue Systems: Conversational AI that maintains context
Training Infrastructure
Large-scale training requires:
- Distributed Training: Multiple GPUs/TPUs working together
- Data Pipeline: Efficient data loading and preprocessing
- Monitoring: Track loss, learning rate, gradient norms
- Checkpointing: Save model state regularly
- Mixed Precision: Use float16 for speed
Production Deployment
Deploying trained transformers:
- Model Optimization: Quantization, pruning for efficiency
- Inference Optimization: Batch processing, caching
- Scalability: Handle multiple concurrent requests
- Monitoring: Track latency, throughput, accuracy