Chapter 3: BERT Architecture
Understanding Encoder-Only Models
Learning Objectives
- Understand bert architecture fundamentals
- Master the mathematical foundations
- Learn practical implementation
- Apply knowledge through examples
- Recognize real-world applications
BERT Architecture
Introduction
Understanding Encoder-Only Models
This chapter provides comprehensive coverage of bert architecture, including detailed explanations, mathematical formulations, code implementations, and real-world examples.
📚 Why This Matters
Understanding bert architecture is crucial for mastering modern AI systems. This chapter breaks down complex concepts into digestible explanations with step-by-step examples.
Key Concepts
Fine-tuning Strategies
Full fine-tuning: Update all model parameters. Most powerful but requires most resources.
Partial fine-tuning: Freeze early layers, only train later layers. Reduces memory and compute.
Parameter-efficient methods: Only train small subset of parameters (LoRA, adapters). Very efficient but may have slight performance trade-off.
Learning Rate Considerations
Why lower learning rates:
- Pre-trained weights are already good
- High learning rates can destroy pre-trained knowledge
- Typical: 1e-5 to 1e-3 (vs 1e-3 to 1e-2 for training from scratch)
- Often use learning rate schedule with warmup
Catastrophic Forgetting
The problem: Fine-tuning on new task can cause model to forget what it learned during pre-training.
Solutions:
- Lower learning rates
- Freeze early layers
- Use regularization
- Continual learning techniques
Mathematical Formulations
Fine-tuning Loss
Where:
- \(\theta_{\text{pre-trained}}\): Pre-trained model parameters
- \(\Delta\theta\): Parameter updates during fine-tuning
- \(y_i, x_i\): Task-specific labeled examples
- Updates are typically small (\(\Delta\theta\) is small)
LoRA (Low-Rank Adaptation)
Where:
- \(W\): Original weight matrix (frozen)
- \(B\): Low-rank matrix (trainable, rank r)
- \(A\): Low-rank matrix (trainable, rank r)
- Only \(B\) and \(A\) are trained, not \(W\)
- Reduces trainable parameters significantly
Learning Rate Schedule
Warmup phase gradually increases learning rate, then linear decay. Prevents large gradient updates early in fine-tuning.
Detailed Examples
Example: Fine-tuning for Sentiment Analysis
Pre-trained model: GPT-2 (general language understanding)
Task: Classify movie reviews as positive or negative
Step 1: Prepare data
- Training examples: ("This movie was amazing!", "positive")
- Format: Review text → sentiment label
Step 2: Add classification head
- Add linear layer on top of pre-trained model
- Output: 2 classes (positive, negative)
Step 3: Fine-tune
- Learning rate: 2e-5 (much lower than pre-training)
- Freeze early layers, train later layers + classification head
- Train for 3-5 epochs
Result: Model adapts general language knowledge to sentiment classification task.
Example: LoRA Fine-tuning
Original weight matrix: W (768×768) = 589,824 parameters
LoRA approach:
- W (frozen): 589,824 parameters
- B (768×8): 6,144 parameters (trainable)
- A (8×768): 6,144 parameters (trainable)
- Total trainable: 12,288 (2% of original!)
Benefits: Much less memory, faster training, can fine-tune on single GPU.
Implementation
Fine-tuning with HuggingFace
from transformers import GPT2ForSequenceClassification, GPT2Tokenizer, Trainer, TrainingArguments
from datasets import Dataset
# Load pre-trained model
model = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=2)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
# Prepare data
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)
# Example data
data = {
"text": ["This movie was amazing!", "Terrible acting.", "It was okay."],
"label": [1, 0, 0] # 1=positive, 0=negative
}
dataset = Dataset.from_dict(data)
tokenized_dataset = dataset.map(tokenize_function, batched=True)
# Training arguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-5, # Low learning rate for fine-tuning
warmup_steps=100,
logging_dir="./logs",
)
# Train
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
)
trainer.train()
LoRA Fine-tuning with PEFT
from transformers import AutoModelForCausalLM
from peft import LoraConfig, get_peft_model, TaskType
# Load model
model = AutoModelForCausalLM.from_pretrained("gpt2")
# Configure LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=8, # Rank (low-rank dimension)
lora_alpha=16,
lora_dropout=0.1,
target_modules=["c_attn", "c_proj"] # Which layers to apply LoRA to
)
# Apply LoRA
model = get_peft_model(model, lora_config)
# Now only LoRA parameters are trainable
print(f"Trainable parameters: {model.num_parameters(only_trainable=True)}")
print(f"Total parameters: {model.num_parameters()}")
# Fine-tune as usual (but with much fewer parameters)
Real-World Applications
Fine-tuning Use Cases
Domain-specific applications:
- Medical: Fine-tune on medical literature for clinical applications
- Legal: Fine-tune on legal documents for contract analysis
- Code: Fine-tune on code repositories for programming assistance
- Customer service: Fine-tune on support tickets for automated responses
Task-specific fine-tuning:
- Sentiment analysis for product reviews
- Named entity recognition for information extraction
- Question answering for knowledge bases
- Text classification for content moderation
When to Use Fine-tuning vs Prompting
Use fine-tuning when:
- You have task-specific labeled data
- You need high performance on specific domain
- Prompting doesn't achieve desired accuracy
- You can afford training time and resources
Use prompting when:
- You have limited or no labeled data
- You need quick iteration
- Task is simple enough for few-shot learning
- You want to avoid training overhead