Chapter 2: Pre-training Strategies
Learning from Unlabeled Data
Learning Objectives
- Understand pre-training strategies fundamentals
- Master the mathematical foundations
- Learn practical implementation
- Apply knowledge through examples
- Recognize real-world applications
Pre-training Strategies
Introduction
Learning from Unlabeled Data
This chapter provides comprehensive coverage of pre-training strategies, including detailed explanations, mathematical formulations, code implementations, and real-world examples.
📚 Why This Matters
Understanding pre-training strategies is crucial for mastering modern AI systems. This chapter breaks down complex concepts into digestible explanations with step-by-step examples.
Key Concepts
Pre-training Objectives
Autoregressive Language Modeling (GPT-style):
- Predict next token given previous tokens
- Unidirectional (left-to-right)
- Enables text generation
- Training: "The cat sat" → predict "on"
Masked Language Modeling (BERT-style):
- Predict masked tokens using bidirectional context
- Bidirectional understanding
- Better for understanding tasks
- Training: "The [MASK] sat" → predict "cat"
Data Requirements
Scale matters:
- GPT-3: Trained on ~500B tokens
- LLaMA: Trained on ~1.4T tokens
- More data generally leads to better performance
- Quality is as important as quantity
Data sources:
- Web text (Common Crawl)
- Books and literature
- Wikipedia and encyclopedias
- Code repositories
- Scientific papers
Training Challenges
Computational requirements:
- GPT-3: Months on thousands of GPUs
- Memory: Models require 100s of GB
- Cost: Millions of dollars in compute
- Infrastructure: Distributed training across data centers
Mathematical Formulations
Autoregressive Language Modeling
Where:
- \(x_i\): Token at position i
- \(\theta\): Model parameters
- Model predicts probability of each token given previous context
Masked Language Modeling
Where:
- \(M\): Set of masked token positions
- \(x_{\backslash M}\): All tokens except masked ones
- Model predicts masked tokens using bidirectional context
Next Sentence Prediction (BERT)
Binary classification task: predict if Sentence_B follows Sentence_A. Helps model understand sentence relationships.
Detailed Examples
Example: Autoregressive Pre-training
Training sequence: "The cat sat on the mat"
Training examples created:
- Context: "The" → Target: "cat"
- Context: "The cat" → Target: "sat"
- Context: "The cat sat" → Target: "on"
- Context: "The cat sat on" → Target: "the"
- Context: "The cat sat on the" → Target: "mat"
Model learns: Given any context, predict the most likely next token. This builds understanding of language patterns, grammar, and semantics.
Example: Masked Language Modeling
Original: "The cat sat on the mat"
Masked version: "The [MASK] sat on the mat"
Model task: Predict what [MASK] should be
Model sees: All tokens except the masked one (bidirectional context)
Prediction: P("cat") = 0.9, P("dog") = 0.05, P("bird") = 0.03, ...
Learning: Model learns to use context from both directions to understand word meaning and relationships.
Implementation
Pre-training Data Preparation
import torch
from torch.utils.data import Dataset
class LanguageModelingDataset(Dataset):
"""Dataset for autoregressive language modeling"""
def __init__(self, texts, tokenizer, max_length=512):
self.texts = texts
self.tokenizer = tokenizer
self.max_length = max_length
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
text = self.texts[idx]
# Tokenize
tokens = self.tokenizer.encode(text, max_length=self.max_length,
truncation=True, padding='max_length')
# Create input and target (shifted by 1)
input_ids = torch.tensor(tokens[:-1])
labels = torch.tensor(tokens[1:])
return input_ids, labels
# Example usage
texts = [
"The cat sat on the mat.",
"Machine learning is fascinating.",
"Transformers revolutionized NLP."
]
# tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# dataset = LanguageModelingDataset(texts, tokenizer)
# dataloader = DataLoader(dataset, batch_size=4, shuffle=True)
Masked Language Modeling Data Preparation
import random
import torch
def create_masked_lm_example(tokens, tokenizer, mask_prob=0.15):
"""
Create masked language modeling example
"""
labels = tokens.copy()
masked_indices = []
for i in range(len(tokens)):
if tokens[i] in [tokenizer.cls_token_id, tokenizer.sep_token_id]:
continue
prob = random.random()
if prob < mask_prob:
masked_indices.append(i)
# 80% of the time, replace with [MASK]
if prob < mask_prob * 0.8:
tokens[i] = tokenizer.mask_token_id
# 10% of the time, replace with random token
elif prob < mask_prob * 0.9:
tokens[i] = random.randint(0, tokenizer.vocab_size - 1)
# 10% of the time, keep original (but still predict it)
return torch.tensor(tokens), torch.tensor(labels), masked_indices
# Example
# tokens = tokenizer.encode("The cat sat on the mat")
# input_ids, labels, masked = create_masked_lm_example(tokens, tokenizer)
Real-World Applications
Pre-training Enables Transfer Learning
Pre-trained models serve as foundation for:
- Fine-tuning: Adapt to specific tasks (classification, QA, NER)
- Few-shot learning: Perform tasks with minimal examples
- Zero-shot learning: Perform tasks without training examples
- Domain adaptation: Transfer to new domains with less data
Pre-training Strategies in Practice
Different objectives for different goals:
- Autoregressive (GPT): Best for generation tasks
- Masked (BERT): Best for understanding tasks
- Hybrid approaches: Combine multiple objectives
- Instruction tuning: Fine-tune on instruction-following data
Scaling Laws
Research shows predictable relationships:
- Performance improves with model size (parameters)
- Performance improves with training data size
- Performance improves with compute budget
- Optimal ratios exist between these factors