Chapter 5: Fine-tuning Strategies

Adapting Pre-trained Models

Learning Objectives

  • Understand fine-tuning strategies fundamentals
  • Master the mathematical foundations
  • Learn practical implementation
  • Apply knowledge through examples
  • Recognize real-world applications

Fine-tuning Strategies

Introduction

Adapting Pre-trained Models

This chapter provides comprehensive coverage of fine-tuning strategies, including detailed explanations, mathematical formulations, code implementations, and real-world examples.

📚 Why This Matters

Understanding fine-tuning strategies is crucial for mastering modern AI systems. This chapter breaks down complex concepts into digestible explanations with step-by-step examples.

Key Concepts

Fine-tuning Strategies

Full fine-tuning: Update all model parameters. Most powerful but requires most memory and compute.

Partial fine-tuning: Freeze early layers, only train later layers. Reduces memory requirements while maintaining most performance.

Parameter-efficient fine-tuning: Only train small subset of parameters (LoRA, adapters, prompt tuning). Very efficient, can run on single GPU.

Instruction Tuning

What it is: Fine-tuning on diverse tasks formatted as instructions. Teaches model to follow instructions and generalize to new tasks.

Example format:

  • Input: "Translate to French: Hello"
  • Output: "Bonjour"

Benefits: Model becomes better at following prompts, can handle diverse tasks, shows improved few-shot performance.

Multi-task Fine-tuning

Training on multiple tasks simultaneously:

  • Combines data from different tasks
  • Model learns to handle diverse scenarios
  • Better generalization than single-task fine-tuning
  • Requires careful task balancing

Mathematical Formulations

Fine-tuning Objective

\[L_{\text{ft}} = -\frac{1}{N} \sum_{i=1}^{N} \log P(y_i | x_i, \theta_0 + \Delta\theta)\]
Where:
  • \(\theta_0\): Pre-trained parameters (frozen or partially frozen)
  • \(\Delta\theta\): Parameter updates (small compared to \(\theta_0\))
  • \(x_i, y_i\): Task-specific input-output pairs
  • Much smaller dataset than pre-training

LoRA Decomposition

\[W' = W_0 + \Delta W = W_0 + BA\]
Where:
  • \(W_0\): Original weight matrix (frozen, d×d)
  • \(B\): Trainable matrix (d×r, rank r)
  • \(A\): Trainable matrix (r×d)
  • Only \(2dr\) parameters trained instead of \(d^2\)
  • Typical: r = 4-16, much smaller than d

Multi-task Loss

\[L = \sum_{t=1}^{T} \alpha_t L_t\]
Where:
  • \(T\): Number of tasks
  • \(L_t\): Loss for task t
  • \(\alpha_t\): Task weight (balances importance)
  • Allows training on multiple tasks simultaneously

Detailed Examples

Example: Full vs LoRA Fine-tuning

Task: Fine-tune GPT-2 for sentiment analysis

Full fine-tuning:

  • Trainable parameters: 124M (all GPT-2 parameters)
  • Memory: ~2GB per batch
  • Training time: ~2 hours on GPU
  • Performance: Best possible

LoRA fine-tuning (r=8):

  • Trainable parameters: ~1M (LoRA matrices only)
  • Memory: ~500MB per batch
  • Training time: ~30 minutes on GPU
  • Performance: ~95% of full fine-tuning

Example: Instruction Tuning

Training examples:

  • "Translate to French: Hello" → "Bonjour"
  • "Summarize: [long text]" → "[summary]"
  • "Classify sentiment: Great movie!" → "Positive"
  • "Answer: What is AI?" → "AI is..."

Result: Model learns to follow instructions and can generalize to new instruction-formatted tasks.

Implementation

Partial Fine-tuning (Freeze Early Layers)

from transformers import GPT2ForSequenceClassification, GPT2Tokenizer

# Load model
model = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=2)

# Freeze early layers (first 6 out of 12)
for i in range(6):
    for param in model.transformer.h[i].parameters():
        param.requires_grad = False

# Later layers remain trainable
# Only train: layers 6-11 + classification head

# Count trainable parameters
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable}, Total: {total}, Ratio: {trainable/total:.2%}")

Instruction Tuning Setup

# Instruction tuning data format
instruction_data = [
    {
        "instruction": "Translate to French",
        "input": "Hello",
        "output": "Bonjour"
    },
    {
        "instruction": "Summarize",
        "input": "Long article text...",
        "output": "Summary text..."
    },
    {
        "instruction": "Classify sentiment",
        "input": "Great movie!",
        "output": "Positive"
    }
]

def format_instruction(example):
    """Format instruction for training"""
    prompt = f"{example['instruction']}: {example['input']}"
    target = example['output']
    return prompt, target

# Use with standard language modeling loss
# Model learns to generate target given instruction+input

Real-World Applications

Fine-tuning Applications

Domain-specific models:

  • Medical LLMs: Fine-tuned on medical literature
  • Legal LLMs: Fine-tuned on legal documents
  • Code models: Fine-tuned on code repositories
  • Customer service: Fine-tuned on support tickets

Task-specific models:

  • Sentiment analysis for product reviews
  • Named entity recognition for information extraction
  • Question answering for knowledge bases
  • Text classification for content moderation

Instruction-tuned Models

Models like ChatGPT, Claude use instruction tuning:

  • Better at following user instructions
  • More helpful and aligned with human intent
  • Can handle diverse tasks without task-specific fine-tuning
  • Show improved safety and reduced harmful outputs

Test Your Understanding

Question 1: What is the main purpose of fine-tuning a pre-trained LLM?

A) To adapt the pre-trained model to a specific task or domain by updating its parameters on task-specific data, leveraging the general knowledge learned during pre-training
B) To retrain the model from scratch
C) To reduce the model size
D) To change the model architecture

Question 2: What is the difference between full fine-tuning and parameter-efficient fine-tuning?

A) Full fine-tuning updates all model parameters and requires more resources, while parameter-efficient methods (like LoRA) only train a small subset of parameters, making it more memory and compute efficient
B) They are identical approaches
C) Full fine-tuning is faster than parameter-efficient methods
D) Parameter-efficient methods require more memory

Question 3: Why are lower learning rates typically used for fine-tuning compared to training from scratch?

A) Pre-trained weights are already good, and high learning rates can destroy this pre-trained knowledge, so smaller updates (1e-5 to 1e-3) preserve the learned representations while adapting to the new task
B) Lower learning rates make training faster
C) Higher learning rates are always better
D) Learning rate doesn't matter for fine-tuning

Question 4: What is catastrophic forgetting in the context of fine-tuning?

A) The phenomenon where fine-tuning on a new task causes the model to forget what it learned during pre-training, which can be mitigated with lower learning rates, freezing early layers, and regularization
B) A technique to improve fine-tuning performance
C) A method to reduce model size
D) A type of optimization algorithm

Question 5: What is instruction tuning?

A) Fine-tuning on diverse tasks formatted as instructions, which teaches the model to follow instructions and generalize to new instruction-formatted tasks
B) A method to reduce model parameters
C) A technique for generating instructions
D) A way to speed up training

Question 6: What is the mathematical formulation for fine-tuning loss?

A) \(L_{\text{ft}} = -\frac{1}{N} \sum_{i=1}^{N} \log P(y_i | x_i, \theta_0 + \Delta\theta)\) where \(\theta_0\) are pre-trained parameters and \(\Delta\theta\) are small updates
B) \(L = \sum_{i=1}^{N} y_i\)
C) \(L = \max_i P(y_i | x_i)\)
D) \(L = \frac{1}{N} \sum_{i=1}^{N} x_i\)

Question 7: What is multi-task fine-tuning?

A) Training on multiple tasks simultaneously by combining data from different tasks, which helps the model learn to handle diverse scenarios and improves generalization
B) Training on one task at a time sequentially
C) Using multiple models for one task
D) A method to reduce training time

Question 8: When should you use fine-tuning versus prompting?

A) Use fine-tuning when you have task-specific labeled data and need high performance on a specific domain, while use prompting when you have limited data and need quick iteration
B) Always use fine-tuning, never prompting
C) Always use prompting, never fine-tuning
D) They are interchangeable

Question 9: What is partial fine-tuning?

A) Freezing early layers and only training later layers plus the task-specific head, which reduces memory requirements while maintaining most performance
B) Training only the first layer
C) Training only the output layer
D) Training with partial data

Question 10: What are some common fine-tuning use cases?

A) Domain-specific applications (medical, legal, code), task-specific models (sentiment analysis, NER, QA), and instruction-tuned models for better instruction following
B) Only image classification
C) Only speech recognition
D) Only data preprocessing

Question 11: What is the learning rate schedule formula used in fine-tuning?

A) Warmup phase gradually increases learning rate, then linear decay: \(\text{lr}(t) = \text{lr}_{\text{max}} \times \frac{t}{T_{\text{warmup}}}\) for warmup, then linear decay afterward
B) Constant learning rate throughout
C) Exponential increase
D) Random learning rate

Question 12: What is the multi-task loss formulation?

A) \(L = \sum_{t=1}^{T} \alpha_t L_t\) where \(T\) is the number of tasks, \(L_t\) is the loss for task t, and \(\alpha_t\) is the task weight
B) \(L = \max_t L_t\)
C) \(L = \frac{1}{T} \sum_{t=1}^{T} L_t\) (equal weights only)
D) \(L = \prod_{t=1}^{T} L_t\)