Chapter 6: LoRA & Parameter-Efficient Fine-tuning
Efficient Adaptation
Learning Objectives
- Understand lora & parameter-efficient fine-tuning fundamentals
- Master the mathematical foundations
- Learn practical implementation
- Apply knowledge through examples
- Recognize real-world applications
LoRA & Parameter-Efficient Fine-tuning
Introduction
Efficient Adaptation
This chapter provides comprehensive coverage of lora & parameter-efficient fine-tuning, including detailed explanations, mathematical formulations, code implementations, and real-world examples.
📚 Why This Matters
Understanding lora & parameter-efficient fine-tuning is crucial for mastering modern AI systems. This chapter breaks down complex concepts into digestible explanations with step-by-step examples.
Key Concepts
Why Parameter-Efficient Fine-tuning?
Problem with full fine-tuning:
- Large models (billions of parameters) require massive memory
- Training all parameters is expensive
- Storing multiple fine-tuned models requires huge storage
- Not feasible for many users or edge devices
Solution: Parameter-efficient methods
- Only train small subset of parameters
- Dramatically reduce memory and compute
- Can fine-tune on single GPU
- Store only small adapter weights
LoRA (Low-Rank Adaptation)
Key insight: Weight updates during fine-tuning have low intrinsic rank. We can approximate updates with low-rank matrices.
How it works:
- Original weights W are frozen
- Add trainable low-rank matrices B and A
- W' = W + BA (where BA is low-rank)
- Only train B and A (much fewer parameters)
Benefits: 10-100x reduction in trainable parameters, minimal performance loss.
Other Parameter-Efficient Methods
Adapter layers: Insert small trainable layers between transformer layers. Only adapters are trained.
Prompt tuning: Learn soft prompts (continuous embeddings) instead of model weights.
Prefix tuning: Similar to prompt tuning but applied to all layers.
Mathematical Formulations
LoRA Weight Update
Where:
- \(W_0 \in \mathbb{R}^{d \times d}\): Original weight matrix (frozen)
- \(B \in \mathbb{R}^{d \times r}\): Trainable matrix (rank r)
- \(A \in \mathbb{R}^{r \times d}\): Trainable matrix
- \(r \ll d\): Rank is much smaller than dimension
- Parameters: \(2dr\) instead of \(d^2\)
Example:
- d = 768, r = 8
- Full: 768² = 589,824 parameters
- LoRA: 2×768×8 = 12,288 parameters (2% of original!)
Forward Pass with LoRA
During forward pass, compute both terms. \(W_0x\) can be computed once and cached. Only \(B(Ax)\) needs recomputation during training.
Rank Selection
For r=8 and d=768, compression ratio is ~2%. Lower rank = more compression but potentially lower performance. Typical ranks: 4-16.
Detailed Examples
Example: LoRA Parameter Reduction
GPT-2 base model: 124M parameters
Full fine-tuning:
- Trainable: 124M parameters
- Memory: ~2GB per batch
- Storage: 500MB per fine-tuned model
LoRA fine-tuning (r=8):
- Trainable: ~1M parameters (LoRA matrices)
- Memory: ~500MB per batch
- Storage: 4MB per fine-tuned model (125x smaller!)
Result: Can store 125 LoRA models in space of 1 full model, train on single GPU.
Example: LoRA Application
Scenario: Fine-tune GPT-2 for code generation
Step 1: Identify target layers
- Apply LoRA to attention layers (c_attn, c_proj)
- These layers capture most task-specific patterns
Step 2: Initialize LoRA matrices
- B initialized to zeros (so W' = W initially)
- A initialized with small random values
- Ensures training starts from pre-trained weights
Step 3: Train only LoRA parameters
- Freeze all original weights
- Only update B and A matrices
- Much faster and memory-efficient
Implementation
LoRA with PEFT Library
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
# Load model
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# Configure LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=8, # Rank (low-rank dimension)
lora_alpha=16, # Scaling factor
lora_dropout=0.1, # Dropout for LoRA layers
target_modules=["c_attn", "c_proj"], # Which layers to apply LoRA to
bias="none" # Don't train bias
)
# Apply LoRA
model = get_peft_model(model, lora_config)
# Print parameter counts
trainable = model.num_parameters(only_trainable=True)
total = model.num_parameters()
print(f"Trainable: {trainable:,}")
print(f"Total: {total:,}")
print(f"Trainable %: {100 * trainable / total:.2f}%")
# Now fine-tune as usual (but only LoRA params update)
# model.train()
# ... training loop ...
Manual LoRA Implementation
import torch
import torch.nn as nn
class LoRALayer(nn.Module):
"""LoRA layer implementation"""
def __init__(self, original_layer, rank=8, alpha=16):
super().__init__()
self.original_layer = original_layer # Frozen
self.rank = rank
self.alpha = alpha
# Get dimensions
if isinstance(original_layer, nn.Linear):
in_features = original_layer.in_features
out_features = original_layer.out_features
else:
raise ValueError("Original layer must be nn.Linear")
# LoRA matrices
self.lora_A = nn.Parameter(torch.randn(rank, in_features) * 0.02)
self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
self.scaling = alpha / rank
def forward(self, x):
# Original output
original_output = self.original_layer(x)
# LoRA output
lora_output = self.lora_B @ (self.lora_A @ x.T).T
lora_output = lora_output * self.scaling
return original_output + lora_output
# Example usage
# original = nn.Linear(768, 768)
# lora_layer = LoRALayer(original, rank=8)
# output = lora_layer(input_tensor)
Real-World Applications
LoRA Use Cases
Personalized models:
- Fine-tune large models for individual users
- Store only small LoRA weights per user
- Enable personalization without massive storage
Multi-task deployment:
- One base model, multiple LoRA adapters
- Switch between tasks by loading different LoRA weights
- Much more efficient than multiple full models
Edge deployment:
- Fine-tune on edge devices with limited memory
- LoRA enables fine-tuning on consumer GPUs
- Makes large models accessible to more users
When to Use LoRA
Use LoRA when:
- Memory/compute is limited
- You need to fine-tune many variants
- Storage is a concern
- Performance trade-off is acceptable
Use full fine-tuning when:
- You need maximum performance
- Resources are abundant
- Only fine-tuning one or few models