Chapter 2: Pre-training Strategies

Learning from Unlabeled Data

Learning Objectives

Understand pre-training strategies fundamentals
Master the mathematical foundations
Learn practical implementation
Apply knowledge through examples
Recognize real-world applications

Pre-training Strategies

Introduction

Learning from Unlabeled Data

This chapter provides comprehensive coverage of pre-training strategies, including detailed explanations, mathematical formulations, code implementations, and real-world examples.

📚 Why This Matters

Understanding pre-training strategies is crucial for mastering modern AI systems. This chapter breaks down complex concepts into digestible explanations with step-by-step examples.

Key Concepts

Pre-training Objectives

Autoregressive Language Modeling (GPT-style):

Predict next token given previous tokens
Unidirectional (left-to-right)
Enables text generation
Training: "The cat sat" → predict "on"

Masked Language Modeling (BERT-style):

Predict masked tokens using bidirectional context
Bidirectional understanding
Better for understanding tasks
Training: "The [MASK] sat" → predict "cat"

Data Requirements

Scale matters:

GPT-3: Trained on ~500B tokens
LLaMA: Trained on ~1.4T tokens
More data generally leads to better performance
Quality is as important as quantity

Data sources:

Web text (Common Crawl)
Books and literature
Wikipedia and encyclopedias
Code repositories
Scientific papers

Training Challenges

Computational requirements:

GPT-3: Months on thousands of GPUs
Memory: Models require 100s of GB
Cost: Millions of dollars in compute
Infrastructure: Distributed training across data centers

Mathematical Formulations

Autoregressive Language Modeling

\[P(x_1, x_2, \ldots, x_n) = \prod_{i=1}^{n} P(x_i | x_1, \ldots, x_{i-1}, \theta)\]

Where:

\(x_i\): Token at position i
\(\theta\): Model parameters
Model predicts probability of each token given previous context

Masked Language Modeling

\[L = -\sum_{i \in M} \log P(x_i | x_{\backslash M}, \theta)\]

Where:

\(M\): Set of masked token positions
\(x_{\backslash M}\): All tokens except masked ones
Model predicts masked tokens using bidirectional context

Next Sentence Prediction (BERT)

\[P(\text{IsNext} | \text{Sentence}_A, \text{Sentence}_B)\]

Binary classification task: predict if Sentence_B follows Sentence_A. Helps model understand sentence relationships.

Detailed Examples

Example: Autoregressive Pre-training

Training sequence: "The cat sat on the mat"

Training examples created:

Context: "The" → Target: "cat"
Context: "The cat" → Target: "sat"
Context: "The cat sat" → Target: "on"
Context: "The cat sat on" → Target: "the"
Context: "The cat sat on the" → Target: "mat"

Model learns: Given any context, predict the most likely next token. This builds understanding of language patterns, grammar, and semantics.

Example: Masked Language Modeling

Original: "The cat sat on the mat"

Masked version: "The [MASK] sat on the mat"

Model task: Predict what [MASK] should be

Model sees: All tokens except the masked one (bidirectional context)

Prediction: P("cat") = 0.9, P("dog") = 0.05, P("bird") = 0.03, ...

Learning: Model learns to use context from both directions to understand word meaning and relationships.

Implementation

Pre-training Data Preparation

import torch
from torch.utils.data import Dataset

class LanguageModelingDataset(Dataset):
    """Dataset for autoregressive language modeling"""
    
    def __init__(self, texts, tokenizer, max_length=512):
        self.texts = texts
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = self.texts[idx]
        
        # Tokenize
        tokens = self.tokenizer.encode(text, max_length=self.max_length, 
                                      truncation=True, padding='max_length')
        
        # Create input and target (shifted by 1)
        input_ids = torch.tensor(tokens[:-1])
        labels = torch.tensor(tokens[1:])
        
        return input_ids, labels

# Example usage
texts = [
    "The cat sat on the mat.",
    "Machine learning is fascinating.",
    "Transformers revolutionized NLP."
]

# tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# dataset = LanguageModelingDataset(texts, tokenizer)
# dataloader = DataLoader(dataset, batch_size=4, shuffle=True)

Masked Language Modeling Data Preparation

import random
import torch

def create_masked_lm_example(tokens, tokenizer, mask_prob=0.15):
    """
    Create masked language modeling example
    """
    labels = tokens.copy()
    masked_indices = []
    
    for i in range(len(tokens)):
        if tokens[i] in [tokenizer.cls_token_id, tokenizer.sep_token_id]:
            continue
        
        prob = random.random()
        if prob < mask_prob:
            masked_indices.append(i)
            # 80% of the time, replace with [MASK]
            if prob < mask_prob * 0.8:
                tokens[i] = tokenizer.mask_token_id
            # 10% of the time, replace with random token
            elif prob < mask_prob * 0.9:
                tokens[i] = random.randint(0, tokenizer.vocab_size - 1)
            # 10% of the time, keep original (but still predict it)
    
    return torch.tensor(tokens), torch.tensor(labels), masked_indices

# Example
# tokens = tokenizer.encode("The cat sat on the mat")
# input_ids, labels, masked = create_masked_lm_example(tokens, tokenizer)

Real-World Applications

Pre-training Enables Transfer Learning

Pre-trained models serve as foundation for:

Fine-tuning: Adapt to specific tasks (classification, QA, NER)
Few-shot learning: Perform tasks with minimal examples
Zero-shot learning: Perform tasks without training examples
Domain adaptation: Transfer to new domains with less data

Pre-training Strategies in Practice

Different objectives for different goals:

Autoregressive (GPT): Best for generation tasks
Masked (BERT): Best for understanding tasks
Hybrid approaches: Combine multiple objectives
Instruction tuning: Fine-tune on instruction-following data

Scaling Laws

Research shows predictable relationships:

Performance improves with model size (parameters)
Performance improves with training data size
Performance improves with compute budget
Optimal ratios exist between these factors

Test Your Understanding

Question 1: What is the main difference between Autoregressive Language Modeling (GPT-style) and Masked Language Modeling (BERT-style)?

A) Autoregressive is unidirectional (left-to-right) and enables generation, while Masked is bidirectional and better for understanding tasks

B) Autoregressive is bidirectional while Masked is unidirectional

C) They are identical approaches with different names

D) Autoregressive is for classification while Masked is for generation

Question 2: In Masked Language Modeling, what percentage of tokens are typically masked during training?

A) Approximately 15% of tokens

B) 50% of tokens

C) All tokens

D) Only 5% of tokens

Question 3: What is Next Sentence Prediction (NSP) used for in BERT pre-training?

A) To help the model understand sentence relationships and improve performance on tasks requiring sentence pair understanding

B) To generate new sentences

C) To classify individual sentences

D) To translate between languages

Question 4: What is the mathematical formulation for Autoregressive Language Modeling?

A) \(P(x_1, x_2, \ldots, x_n) = \prod_{i=1}^{n} P(x_i | x_1, \ldots, x_{i-1}, \theta)\)

B) \(P(x_1, x_2, \ldots, x_n) = \sum_{i=1}^{n} P(x_i | x_1, \ldots, x_{i-1})\)

C) \(P(x_1, x_2, \ldots, x_n) = \max_{i} P(x_i | x_1, \ldots, x_{i-1})\)

D) \(P(x_1, x_2, \ldots, x_n) = \frac{1}{n} \sum_{i=1}^{n} P(x_i)\)

Question 5: Approximately how many tokens was GPT-3 trained on?

A) ~500 billion tokens

B) ~50 billion tokens

C) ~5 trillion tokens

D) ~1 million tokens

Question 6: What are the main data sources used for pre-training large language models?

A) Web text (Common Crawl), books, Wikipedia, code repositories, and scientific papers

B) Only Wikipedia

C) Only social media posts

D) Only news articles

Question 7: In the Masked Language Modeling loss function, what does \(x_{\backslash M}\) represent?

A) All tokens except the masked ones

B) Only the masked tokens

C) The model parameters

D) The vocabulary size

Question 8: What is the primary computational challenge in pre-training large language models?

A) Massive memory requirements (100s of GB), months of training on thousands of GPUs, and millions of dollars in compute costs

B) Finding enough training data

C) Choosing the right learning rate

D) Implementing the loss function

Question 9: What does the scaling law research show about LLM performance?

A) Performance improves predictably with model size, training data size, and compute budget, with optimal ratios existing between these factors

B) Performance is random regardless of scale

C) Only model size matters, not data or compute

D) Smaller models always perform better

Question 10: In autoregressive pre-training, how are training examples created from a sequence like "The cat sat on the mat"?

A) Multiple examples are created where each token is predicted given all previous tokens (e.g., "The" → "cat", "The cat" → "sat", etc.)

B) Only the last token is predicted

C) All tokens are predicted simultaneously

D) Only the first token is used

Question 11: What is the primary advantage of pre-training for downstream tasks?

A) It enables transfer learning, allowing models to be fine-tuned on specific tasks with less data, and supports few-shot and zero-shot learning

B) It eliminates the need for any fine-tuning

C) It makes models smaller

D) It reduces training time for all tasks

Question 12: When creating masked language modeling examples, what happens to masked tokens 80% of the time?

A) They are replaced with the [MASK] token

B) They are left unchanged

C) They are deleted from the sequence

D) They are replaced with random tokens