Chapter 4: GPT Architecture
Understanding Decoder-Only Models
Learning Objectives
- Understand gpt architecture fundamentals
- Master the mathematical foundations
- Learn practical implementation
- Apply knowledge through examples
- Recognize real-world applications
GPT Architecture
Introduction
Understanding Decoder-Only Models
This chapter provides comprehensive coverage of gpt architecture, including detailed explanations, mathematical formulations, code implementations, and real-world examples.
📚 Why This Matters
Understanding gpt architecture is crucial for mastering modern AI systems. This chapter breaks down complex concepts into digestible explanations with step-by-step examples.
Key Concepts
GPT Architecture Overview
GPT (Generative Pre-trained Transformer) is decoder-only:
- Uses only decoder layers from transformer architecture
- Autoregressive: generates text one token at a time
- Causal masking: can only attend to previous tokens
- Pre-trained on next token prediction
Key difference from BERT:
- BERT: Bidirectional, understands context from both directions
- GPT: Unidirectional, generates left-to-right
- BERT: Better for understanding tasks
- GPT: Better for generation tasks
Scaling in GPT Models
GPT models have scaled dramatically:
- GPT-1 (2018): 117M parameters
- GPT-2 (2019): 1.5B parameters
- GPT-3 (2020): 175B parameters
- GPT-4 (2023): Estimated 1T+ parameters
With scale comes:
- Better language understanding
- Emergent capabilities (reasoning, few-shot learning)
- More coherent and contextually appropriate generation
- Ability to follow complex instructions
Generation Process
How GPT generates text:
- Start with prompt/context
- Process through transformer layers
- Output probability distribution over vocabulary
- Sample next token (greedy or with temperature)
- Append to sequence, repeat until stop condition
Mathematical Formulations
Autoregressive Generation
Where:
- \(x_1, \ldots, x_t\): Previous tokens
- \(h_t\): Hidden state at position t (from transformer)
- \(W\): Output projection matrix
- Output: Probability distribution over vocabulary
Complete Sequence Probability
The probability of a complete sequence is the product of conditional probabilities. Each token's probability depends on all previous tokens.
Causal Attention Mask
Position i can only attend to positions j where j ≤ i. This ensures the model cannot see future tokens during training or generation.
Detailed Examples
Example: GPT Generation Process
Prompt: "The capital of France is"
Step 1: Tokenization
- Input → ["The", "capital", "of", "France", "is"]
- Each token converted to embedding
Step 2: Forward Pass
- Process through 12-96 transformer layers (depending on model)
- Each layer refines the representation
- Final hidden state for "is" position
Step 3: Prediction
- Output projection → vocabulary probabilities
- P("Paris") = 0.85, P("London") = 0.05, ...
Step 4: Sampling
- Sample "Paris"
- New sequence: "The capital of France is Paris"
- Continue generating if needed
Example: Causal Masking
Sequence: "The cat sat"
Attention patterns:
- "The" can only attend to itself
- "cat" can attend to "The" and itself
- "sat" can attend to "The", "cat", and itself
- Cannot see future tokens
This matches inference: During generation, model only sees previous tokens, so training with causal masking ensures consistency.
Implementation
GPT Text Generation
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
# Load model
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
def generate_text(prompt, max_length=50, temperature=0.7):
"""
Generate text using GPT model
"""
# Tokenize input
inputs = tokenizer(prompt, return_tensors="pt")
# Generate
with torch.no_grad():
outputs = model.generate(
inputs.input_ids,
max_length=max_length,
temperature=temperature,
do_sample=True,
top_k=50,
top_p=0.95,
pad_token_id=tokenizer.eos_token_id
)
# Decode
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
return generated_text
# Example
prompt = "The capital of France is"
result = generate_text(prompt)
print(result) # "The capital of France is Paris"
Implementing Causal Mask
import torch
import torch.nn.functional as F
def create_causal_mask(seq_len):
"""
Create causal attention mask
"""
mask = torch.tril(torch.ones(seq_len, seq_len))
return mask
def masked_attention(Q, K, V, mask):
"""
Apply causal masking to attention
"""
# Compute attention scores
scores = torch.matmul(Q, K.transpose(-2, -1)) / (Q.size(-1) ** 0.5)
# Apply mask (set masked positions to -inf)
mask = mask.unsqueeze(0).unsqueeze(0) # Add batch and head dimensions
scores = scores.masked_fill(mask == 0, float('-inf'))
# Softmax
attention_weights = F.softmax(scores, dim=-1)
# Apply to values
output = torch.matmul(attention_weights, V)
return output
# Example
seq_len = 5
Q = torch.randn(1, 1, seq_len, 64) # (batch, heads, seq, dim)
K = torch.randn(1, 1, seq_len, 64)
V = torch.randn(1, 1, seq_len, 64)
mask = create_causal_mask(seq_len)
output = masked_attention(Q, K, V, mask)
Real-World Applications
GPT Applications
Text Generation:
- Chatbots and conversational AI (ChatGPT)
- Content creation (articles, stories, marketing)
- Creative writing assistance
- Email and document drafting
Code Generation:
- GitHub Copilot (code completion)
- Code generation from natural language
- Code explanation and documentation
- Bug fixing and refactoring suggestions
Specialized Tasks:
- Text summarization
- Translation
- Question answering
- Data extraction and formatting
GPT vs BERT Use Cases
Use GPT when:
- You need to generate new text
- Task requires creativity or continuation
- You want conversational responses
- Task benefits from autoregressive generation
Use BERT when:
- You need to understand/classify existing text
- Task requires bidirectional context
- You're extracting information, not generating
- Task is classification or understanding