Chapter 6: LoRA & Parameter-Efficient Fine-tuning

Efficient Adaptation

Learning Objectives

  • Understand lora & parameter-efficient fine-tuning fundamentals
  • Master the mathematical foundations
  • Learn practical implementation
  • Apply knowledge through examples
  • Recognize real-world applications

LoRA & Parameter-Efficient Fine-tuning

Introduction

Efficient Adaptation

This chapter provides comprehensive coverage of lora & parameter-efficient fine-tuning, including detailed explanations, mathematical formulations, code implementations, and real-world examples.

📚 Why This Matters

Understanding lora & parameter-efficient fine-tuning is crucial for mastering modern AI systems. This chapter breaks down complex concepts into digestible explanations with step-by-step examples.

Key Concepts

Why Parameter-Efficient Fine-tuning?

Problem with full fine-tuning:

  • Large models (billions of parameters) require massive memory
  • Training all parameters is expensive
  • Storing multiple fine-tuned models requires huge storage
  • Not feasible for many users or edge devices

Solution: Parameter-efficient methods

  • Only train small subset of parameters
  • Dramatically reduce memory and compute
  • Can fine-tune on single GPU
  • Store only small adapter weights

LoRA (Low-Rank Adaptation)

Key insight: Weight updates during fine-tuning have low intrinsic rank. We can approximate updates with low-rank matrices.

How it works:

  • Original weights W are frozen
  • Add trainable low-rank matrices B and A
  • W' = W + BA (where BA is low-rank)
  • Only train B and A (much fewer parameters)

Benefits: 10-100x reduction in trainable parameters, minimal performance loss.

Other Parameter-Efficient Methods

Adapter layers: Insert small trainable layers between transformer layers. Only adapters are trained.

Prompt tuning: Learn soft prompts (continuous embeddings) instead of model weights.

Prefix tuning: Similar to prompt tuning but applied to all layers.

Mathematical Formulations

LoRA Weight Update

\[W' = W_0 + \Delta W = W_0 + BA\]
Where:
  • \(W_0 \in \mathbb{R}^{d \times d}\): Original weight matrix (frozen)
  • \(B \in \mathbb{R}^{d \times r}\): Trainable matrix (rank r)
  • \(A \in \mathbb{R}^{r \times d}\): Trainable matrix
  • \(r \ll d\): Rank is much smaller than dimension
  • Parameters: \(2dr\) instead of \(d^2\)
Example:
  • d = 768, r = 8
  • Full: 768² = 589,824 parameters
  • LoRA: 2×768×8 = 12,288 parameters (2% of original!)

Forward Pass with LoRA

\[h = W'x = (W_0 + BA)x = W_0x + B(Ax)\]

During forward pass, compute both terms. \(W_0x\) can be computed once and cached. Only \(B(Ax)\) needs recomputation during training.

Rank Selection

\[\text{Compression ratio} = \frac{2dr}{d^2} = \frac{2r}{d}\]

For r=8 and d=768, compression ratio is ~2%. Lower rank = more compression but potentially lower performance. Typical ranks: 4-16.

Detailed Examples

Example: LoRA Parameter Reduction

GPT-2 base model: 124M parameters

Full fine-tuning:

  • Trainable: 124M parameters
  • Memory: ~2GB per batch
  • Storage: 500MB per fine-tuned model

LoRA fine-tuning (r=8):

  • Trainable: ~1M parameters (LoRA matrices)
  • Memory: ~500MB per batch
  • Storage: 4MB per fine-tuned model (125x smaller!)

Result: Can store 125 LoRA models in space of 1 full model, train on single GPU.

Example: LoRA Application

Scenario: Fine-tune GPT-2 for code generation

Step 1: Identify target layers

  • Apply LoRA to attention layers (c_attn, c_proj)
  • These layers capture most task-specific patterns

Step 2: Initialize LoRA matrices

  • B initialized to zeros (so W' = W initially)
  • A initialized with small random values
  • Ensures training starts from pre-trained weights

Step 3: Train only LoRA parameters

  • Freeze all original weights
  • Only update B and A matrices
  • Much faster and memory-efficient

Implementation

LoRA with PEFT Library

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType

# Load model
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,  # Rank (low-rank dimension)
    lora_alpha=16,  # Scaling factor
    lora_dropout=0.1,  # Dropout for LoRA layers
    target_modules=["c_attn", "c_proj"],  # Which layers to apply LoRA to
    bias="none"  # Don't train bias
)

# Apply LoRA
model = get_peft_model(model, lora_config)

# Print parameter counts
trainable = model.num_parameters(only_trainable=True)
total = model.num_parameters()
print(f"Trainable: {trainable:,}")
print(f"Total: {total:,}")
print(f"Trainable %: {100 * trainable / total:.2f}%")

# Now fine-tune as usual (but only LoRA params update)
# model.train()
# ... training loop ...

Manual LoRA Implementation

import torch
import torch.nn as nn

class LoRALayer(nn.Module):
    """LoRA layer implementation"""
    
    def __init__(self, original_layer, rank=8, alpha=16):
        super().__init__()
        self.original_layer = original_layer  # Frozen
        self.rank = rank
        self.alpha = alpha
        
        # Get dimensions
        if isinstance(original_layer, nn.Linear):
            in_features = original_layer.in_features
            out_features = original_layer.out_features
        else:
            raise ValueError("Original layer must be nn.Linear")
        
        # LoRA matrices
        self.lora_A = nn.Parameter(torch.randn(rank, in_features) * 0.02)
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        self.scaling = alpha / rank
    
    def forward(self, x):
        # Original output
        original_output = self.original_layer(x)
        
        # LoRA output
        lora_output = self.lora_B @ (self.lora_A @ x.T).T
        lora_output = lora_output * self.scaling
        
        return original_output + lora_output

# Example usage
# original = nn.Linear(768, 768)
# lora_layer = LoRALayer(original, rank=8)
# output = lora_layer(input_tensor)

Real-World Applications

LoRA Use Cases

Personalized models:

  • Fine-tune large models for individual users
  • Store only small LoRA weights per user
  • Enable personalization without massive storage

Multi-task deployment:

  • One base model, multiple LoRA adapters
  • Switch between tasks by loading different LoRA weights
  • Much more efficient than multiple full models

Edge deployment:

  • Fine-tune on edge devices with limited memory
  • LoRA enables fine-tuning on consumer GPUs
  • Makes large models accessible to more users

When to Use LoRA

Use LoRA when:

  • Memory/compute is limited
  • You need to fine-tune many variants
  • Storage is a concern
  • Performance trade-off is acceptable

Use full fine-tuning when:

  • You need maximum performance
  • Resources are abundant
  • Only fine-tuning one or few models

Test Your Understanding

Question 1: What is the main problem that parameter-efficient fine-tuning methods like LoRA solve?

A) Large models require massive memory and compute for full fine-tuning, making it infeasible for many users. Parameter-efficient methods only train a small subset of parameters, dramatically reducing requirements
B) Models are too small
C) Training is too fast
D) Models don't need fine-tuning

Question 2: What is the key insight behind LoRA (Low-Rank Adaptation)?

A) Weight updates during fine-tuning have low intrinsic rank, so we can approximate updates with low-rank matrices B and A, where W' = W + BA, training only B and A instead of the full weight matrix W
B) All weights must be trained
C) High-rank matrices are needed
D) LoRA increases model size

Question 3: What is the mathematical formulation for LoRA weight update?

A) \(W' = W_0 + \Delta W = W_0 + BA\) where \(W_0\) is frozen, and only low-rank matrices B and A are trainable
B) \(W' = W_0 \times BA\)
C) \(W' = W_0 - BA\)
D) \(W' = BA\) only

Question 4: For a weight matrix of size 768×768, how many parameters does LoRA train if rank r=8?

A) 12,288 parameters (2×768×8), which is only about 2% of the original 589,824 parameters
B) 589,824 parameters (same as full fine-tuning)
C) 768 parameters
D) 8 parameters

Question 5: What are other parameter-efficient fine-tuning methods besides LoRA?

A) Adapter layers (small trainable layers inserted between transformer layers), prompt tuning (learn soft prompts), and prefix tuning (similar to prompt tuning but applied to all layers)
B) Only LoRA exists
C) Full fine-tuning is the only method
D) Only freezing layers

Question 6: What is the forward pass computation with LoRA?

A) \(h = W'x = (W_0 + BA)x = W_0x + B(Ax)\), where \(W_0x\) can be cached and only \(B(Ax)\) needs recomputation during training
B) \(h = W_0x\) only
C) \(h = BAx\) only
D) \(h = W_0 \times BA \times x\)

Question 7: What is the compression ratio formula for LoRA?

A) \(\text{Compression ratio} = \frac{2dr}{d^2} = \frac{2r}{d}\), which for r=8 and d=768 is approximately 2%
B) \(\text{Compression ratio} = \frac{d}{r}\)
C) \(\text{Compression ratio} = d^2\)
D) \(\text{Compression ratio} = 2r\)

Question 8: What are typical rank values (r) used in LoRA?

A) Typically 4-16, where lower rank means more compression but potentially lower performance, and higher rank means better performance but more parameters
B) Always 1
C) Always equal to the dimension d
D) Random values

Question 9: What are the benefits of LoRA fine-tuning compared to full fine-tuning?

A) 10-100x reduction in trainable parameters, much less memory usage, faster training, ability to fine-tune on single GPU, and can store many LoRA adapters in the space of one full model
B) LoRA requires more memory
C) LoRA is slower
D) LoRA requires multiple GPUs

Question 10: How are LoRA matrices typically initialized?

A) B is initialized to zeros (so W' = W initially) and A is initialized with small random values, ensuring training starts from pre-trained weights
B) Both are initialized randomly
C) Both are initialized to zeros
D) Both are initialized to ones

Question 11: What are some use cases for LoRA?

A) Personalized models (one base model with multiple user-specific LoRA adapters), multi-task deployment (switch between tasks by loading different LoRA weights), and edge deployment (fine-tune on devices with limited memory)
B) Only for large-scale training
C) Only for image tasks
D) Only for speech recognition

Question 12: What is the typical performance trade-off when using LoRA compared to full fine-tuning?

A) LoRA typically achieves 90-95% of full fine-tuning performance while using only 1-10% of the trainable parameters, making it an excellent efficiency-performance trade-off
B) LoRA always outperforms full fine-tuning
C) LoRA performs significantly worse (less than 50% of full fine-tuning)
D) LoRA and full fine-tuning have identical performance