Chapter 4: GPT Architecture

Understanding Decoder-Only Models

Learning Objectives

  • Understand gpt architecture fundamentals
  • Master the mathematical foundations
  • Learn practical implementation
  • Apply knowledge through examples
  • Recognize real-world applications

GPT Architecture

Introduction

Understanding Decoder-Only Models

This chapter provides comprehensive coverage of gpt architecture, including detailed explanations, mathematical formulations, code implementations, and real-world examples.

📚 Why This Matters

Understanding gpt architecture is crucial for mastering modern AI systems. This chapter breaks down complex concepts into digestible explanations with step-by-step examples.

Key Concepts

GPT Architecture Overview

GPT (Generative Pre-trained Transformer) is decoder-only:

  • Uses only decoder layers from transformer architecture
  • Autoregressive: generates text one token at a time
  • Causal masking: can only attend to previous tokens
  • Pre-trained on next token prediction

Key difference from BERT:

  • BERT: Bidirectional, understands context from both directions
  • GPT: Unidirectional, generates left-to-right
  • BERT: Better for understanding tasks
  • GPT: Better for generation tasks

Scaling in GPT Models

GPT models have scaled dramatically:

  • GPT-1 (2018): 117M parameters
  • GPT-2 (2019): 1.5B parameters
  • GPT-3 (2020): 175B parameters
  • GPT-4 (2023): Estimated 1T+ parameters

With scale comes:

  • Better language understanding
  • Emergent capabilities (reasoning, few-shot learning)
  • More coherent and contextually appropriate generation
  • Ability to follow complex instructions

Generation Process

How GPT generates text:

  • Start with prompt/context
  • Process through transformer layers
  • Output probability distribution over vocabulary
  • Sample next token (greedy or with temperature)
  • Append to sequence, repeat until stop condition

Mathematical Formulations

Autoregressive Generation

\[P(x_{t+1} | x_1, \ldots, x_t) = \text{softmax}(W \cdot h_t)\]
Where:
  • \(x_1, \ldots, x_t\): Previous tokens
  • \(h_t\): Hidden state at position t (from transformer)
  • \(W\): Output projection matrix
  • Output: Probability distribution over vocabulary

Complete Sequence Probability

\[P(x_1, \ldots, x_n) = \prod_{t=1}^{n} P(x_t | x_1, \ldots, x_{t-1})\]

The probability of a complete sequence is the product of conditional probabilities. Each token's probability depends on all previous tokens.

Causal Attention Mask

\[\text{Mask}[i, j] = \begin{cases} 1 & \text{if } j \leq i \\ 0 & \text{if } j > i \end{cases}\]

Position i can only attend to positions j where j ≤ i. This ensures the model cannot see future tokens during training or generation.

Detailed Examples

Example: GPT Generation Process

Prompt: "The capital of France is"

Step 1: Tokenization

  • Input → ["The", "capital", "of", "France", "is"]
  • Each token converted to embedding

Step 2: Forward Pass

  • Process through 12-96 transformer layers (depending on model)
  • Each layer refines the representation
  • Final hidden state for "is" position

Step 3: Prediction

  • Output projection → vocabulary probabilities
  • P("Paris") = 0.85, P("London") = 0.05, ...

Step 4: Sampling

  • Sample "Paris"
  • New sequence: "The capital of France is Paris"
  • Continue generating if needed

Example: Causal Masking

Sequence: "The cat sat"

Attention patterns:

  • "The" can only attend to itself
  • "cat" can attend to "The" and itself
  • "sat" can attend to "The", "cat", and itself
  • Cannot see future tokens

This matches inference: During generation, model only sees previous tokens, so training with causal masking ensures consistency.

Implementation

GPT Text Generation

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

# Load model
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

def generate_text(prompt, max_length=50, temperature=0.7):
    """
    Generate text using GPT model
    """
    # Tokenize input
    inputs = tokenizer(prompt, return_tensors="pt")
    
    # Generate
    with torch.no_grad():
        outputs = model.generate(
            inputs.input_ids,
            max_length=max_length,
            temperature=temperature,
            do_sample=True,
            top_k=50,
            top_p=0.95,
            pad_token_id=tokenizer.eos_token_id
        )
    
    # Decode
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated_text

# Example
prompt = "The capital of France is"
result = generate_text(prompt)
print(result)  # "The capital of France is Paris"

Implementing Causal Mask

import torch
import torch.nn.functional as F

def create_causal_mask(seq_len):
    """
    Create causal attention mask
    """
    mask = torch.tril(torch.ones(seq_len, seq_len))
    return mask

def masked_attention(Q, K, V, mask):
    """
    Apply causal masking to attention
    """
    # Compute attention scores
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (Q.size(-1) ** 0.5)
    
    # Apply mask (set masked positions to -inf)
    mask = mask.unsqueeze(0).unsqueeze(0)  # Add batch and head dimensions
    scores = scores.masked_fill(mask == 0, float('-inf'))
    
    # Softmax
    attention_weights = F.softmax(scores, dim=-1)
    
    # Apply to values
    output = torch.matmul(attention_weights, V)
    return output

# Example
seq_len = 5
Q = torch.randn(1, 1, seq_len, 64)  # (batch, heads, seq, dim)
K = torch.randn(1, 1, seq_len, 64)
V = torch.randn(1, 1, seq_len, 64)
mask = create_causal_mask(seq_len)
output = masked_attention(Q, K, V, mask)

Real-World Applications

GPT Applications

Text Generation:

  • Chatbots and conversational AI (ChatGPT)
  • Content creation (articles, stories, marketing)
  • Creative writing assistance
  • Email and document drafting

Code Generation:

  • GitHub Copilot (code completion)
  • Code generation from natural language
  • Code explanation and documentation
  • Bug fixing and refactoring suggestions

Specialized Tasks:

  • Text summarization
  • Translation
  • Question answering
  • Data extraction and formatting

GPT vs BERT Use Cases

Use GPT when:

  • You need to generate new text
  • Task requires creativity or continuation
  • You want conversational responses
  • Task benefits from autoregressive generation

Use BERT when:

  • You need to understand/classify existing text
  • Task requires bidirectional context
  • You're extracting information, not generating
  • Task is classification or understanding

Test Your Understanding

Question 1: What is the key architectural characteristic of GPT models?

A) GPT uses decoder-only transformer layers with causal (unidirectional) attention, enabling autoregressive text generation
B) GPT uses encoder-only layers with bidirectional attention
C) GPT uses both encoder and decoder layers
D) GPT uses only feedforward layers

Question 2: What is causal masking in GPT?

A) A mechanism that prevents tokens from attending to future tokens, ensuring each position can only see previous positions (j ≤ i)
B) A method to hide model parameters
C) A technique to mask input tokens randomly
D) A way to reduce model size

Question 3: How does GPT generate text?

A) Autoregressively, one token at a time, where each new token is predicted based on all previous tokens, and the process continues until a stop condition
B) All tokens are generated simultaneously
C) Tokens are generated from right to left
D) Tokens are generated randomly

Question 4: What is the mathematical formulation for autoregressive generation in GPT?

A) \(P(x_1, \ldots, x_n) = \prod_{t=1}^{n} P(x_t | x_1, \ldots, x_{t-1})\)
B) \(P(x_1, \ldots, x_n) = \sum_{t=1}^{n} P(x_t)\)
C) \(P(x_1, \ldots, x_n) = \max_t P(x_t)\)
D) \(P(x_1, \ldots, x_n) = \frac{1}{n} \sum_{t=1}^{n} P(x_t)\)

Question 5: How has GPT scaled from GPT-1 to GPT-4?

A) From 117M parameters (GPT-1) to an estimated 1T+ parameters (GPT-4), with dramatic improvements in capabilities and emergent abilities
B) The model size has remained constant
C) GPT models have gotten smaller over time
D) Only the training data has changed, not the model size

Question 6: What is the primary advantage of GPT's autoregressive architecture?

A) It enables natural text generation where each token is coherently built upon previous context, making it ideal for creative writing, code generation, and conversational AI
B) It allows bidirectional understanding
C) It processes all tokens simultaneously
D) It requires less memory

Question 7: What happens during GPT's forward pass for text generation?

A) The prompt is processed through transformer layers, producing a probability distribution over the vocabulary for the next token, which is then sampled and appended to the sequence
B) All possible sequences are generated simultaneously
C) Only the first token is generated
D) Tokens are generated in reverse order

Question 8: What is the key difference between GPT and BERT in terms of their training objectives?

A) GPT is trained on next token prediction (autoregressive), while BERT is trained on masked token prediction (bidirectional)
B) They use identical training objectives
C) GPT is trained on classification while BERT is trained on generation
D) GPT uses masked language modeling while BERT uses next token prediction

Question 9: What are some key applications of GPT models?

A) Chatbots and conversational AI, content creation, code generation (like GitHub Copilot), text summarization, and creative writing
B) Only image classification
C) Only speech recognition
D) Only data analysis

Question 10: What does the causal attention mask ensure during GPT training?

A) That the model cannot see future tokens during training, which matches the inference scenario where only previous tokens are available
B) That all tokens can see all other tokens
C) That tokens are randomly masked
D) That only the first token is visible

Question 11: What is the mathematical expression for the probability of the next token in GPT?

A) \(P(x_{t+1} | x_1, \ldots, x_t) = \text{softmax}(W \cdot h_t)\) where \(h_t\) is the hidden state from the transformer
B) \(P(x_{t+1}) = \text{random()}\)
C) \(P(x_{t+1} | x_1, \ldots, x_t) = \text{constant}\)
D) \(P(x_{t+1}) = \frac{1}{\text{vocab\_size}}\)

Question 12: What are some emergent capabilities that appear in larger GPT models like GPT-3 and GPT-4?

A) Few-shot learning, chain-of-thought reasoning, instruction following, code understanding, and improved generalization to new tasks
B) Only text generation
C) Only classification tasks
D) No new capabilities emerge with scale