Chapter 5: Fine-tuning Strategies
Adapting Pre-trained Models
Learning Objectives
- Understand fine-tuning strategies fundamentals
- Master the mathematical foundations
- Learn practical implementation
- Apply knowledge through examples
- Recognize real-world applications
Fine-tuning Strategies
Introduction
Adapting Pre-trained Models
This chapter provides comprehensive coverage of fine-tuning strategies, including detailed explanations, mathematical formulations, code implementations, and real-world examples.
📚 Why This Matters
Understanding fine-tuning strategies is crucial for mastering modern AI systems. This chapter breaks down complex concepts into digestible explanations with step-by-step examples.
Key Concepts
Fine-tuning Strategies
Full fine-tuning: Update all model parameters. Most powerful but requires most memory and compute.
Partial fine-tuning: Freeze early layers, only train later layers. Reduces memory requirements while maintaining most performance.
Parameter-efficient fine-tuning: Only train small subset of parameters (LoRA, adapters, prompt tuning). Very efficient, can run on single GPU.
Instruction Tuning
What it is: Fine-tuning on diverse tasks formatted as instructions. Teaches model to follow instructions and generalize to new tasks.
Example format:
- Input: "Translate to French: Hello"
- Output: "Bonjour"
Benefits: Model becomes better at following prompts, can handle diverse tasks, shows improved few-shot performance.
Multi-task Fine-tuning
Training on multiple tasks simultaneously:
- Combines data from different tasks
- Model learns to handle diverse scenarios
- Better generalization than single-task fine-tuning
- Requires careful task balancing
Mathematical Formulations
Fine-tuning Objective
Where:
- \(\theta_0\): Pre-trained parameters (frozen or partially frozen)
- \(\Delta\theta\): Parameter updates (small compared to \(\theta_0\))
- \(x_i, y_i\): Task-specific input-output pairs
- Much smaller dataset than pre-training
LoRA Decomposition
Where:
- \(W_0\): Original weight matrix (frozen, d×d)
- \(B\): Trainable matrix (d×r, rank r)
- \(A\): Trainable matrix (r×d)
- Only \(2dr\) parameters trained instead of \(d^2\)
- Typical: r = 4-16, much smaller than d
Multi-task Loss
Where:
- \(T\): Number of tasks
- \(L_t\): Loss for task t
- \(\alpha_t\): Task weight (balances importance)
- Allows training on multiple tasks simultaneously
Detailed Examples
Example: Full vs LoRA Fine-tuning
Task: Fine-tune GPT-2 for sentiment analysis
Full fine-tuning:
- Trainable parameters: 124M (all GPT-2 parameters)
- Memory: ~2GB per batch
- Training time: ~2 hours on GPU
- Performance: Best possible
LoRA fine-tuning (r=8):
- Trainable parameters: ~1M (LoRA matrices only)
- Memory: ~500MB per batch
- Training time: ~30 minutes on GPU
- Performance: ~95% of full fine-tuning
Example: Instruction Tuning
Training examples:
- "Translate to French: Hello" → "Bonjour"
- "Summarize: [long text]" → "[summary]"
- "Classify sentiment: Great movie!" → "Positive"
- "Answer: What is AI?" → "AI is..."
Result: Model learns to follow instructions and can generalize to new instruction-formatted tasks.
Implementation
Partial Fine-tuning (Freeze Early Layers)
from transformers import GPT2ForSequenceClassification, GPT2Tokenizer
# Load model
model = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=2)
# Freeze early layers (first 6 out of 12)
for i in range(6):
for param in model.transformer.h[i].parameters():
param.requires_grad = False
# Later layers remain trainable
# Only train: layers 6-11 + classification head
# Count trainable parameters
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable}, Total: {total}, Ratio: {trainable/total:.2%}")
Instruction Tuning Setup
# Instruction tuning data format
instruction_data = [
{
"instruction": "Translate to French",
"input": "Hello",
"output": "Bonjour"
},
{
"instruction": "Summarize",
"input": "Long article text...",
"output": "Summary text..."
},
{
"instruction": "Classify sentiment",
"input": "Great movie!",
"output": "Positive"
}
]
def format_instruction(example):
"""Format instruction for training"""
prompt = f"{example['instruction']}: {example['input']}"
target = example['output']
return prompt, target
# Use with standard language modeling loss
# Model learns to generate target given instruction+input
Real-World Applications
Fine-tuning Applications
Domain-specific models:
- Medical LLMs: Fine-tuned on medical literature
- Legal LLMs: Fine-tuned on legal documents
- Code models: Fine-tuned on code repositories
- Customer service: Fine-tuned on support tickets
Task-specific models:
- Sentiment analysis for product reviews
- Named entity recognition for information extraction
- Question answering for knowledge bases
- Text classification for content moderation
Instruction-tuned Models
Models like ChatGPT, Claude use instruction tuning:
- Better at following user instructions
- More helpful and aligned with human intent
- Can handle diverse tasks without task-specific fine-tuning
- Show improved safety and reduced harmful outputs