Chapter 3: BERT Architecture

Understanding Encoder-Only Models

Learning Objectives

Understand bert architecture fundamentals
Master the mathematical foundations
Learn practical implementation
Apply knowledge through examples
Recognize real-world applications

BERT Architecture

Introduction

Understanding Encoder-Only Models

This chapter provides comprehensive coverage of bert architecture, including detailed explanations, mathematical formulations, code implementations, and real-world examples.

📚 Why This Matters

Understanding bert architecture is crucial for mastering modern AI systems. This chapter breaks down complex concepts into digestible explanations with step-by-step examples.

Key Concepts

Fine-tuning Strategies

Full fine-tuning: Update all model parameters. Most powerful but requires most resources.

Partial fine-tuning: Freeze early layers, only train later layers. Reduces memory and compute.

Parameter-efficient methods: Only train small subset of parameters (LoRA, adapters). Very efficient but may have slight performance trade-off.

Learning Rate Considerations

Why lower learning rates:

Pre-trained weights are already good
High learning rates can destroy pre-trained knowledge
Typical: 1e-5 to 1e-3 (vs 1e-3 to 1e-2 for training from scratch)
Often use learning rate schedule with warmup

Catastrophic Forgetting

The problem: Fine-tuning on new task can cause model to forget what it learned during pre-training.

Solutions:

Lower learning rates
Freeze early layers
Use regularization
Continual learning techniques

Mathematical Formulations

Fine-tuning Loss

\[L_{\text{fine-tune}} = -\frac{1}{N} \sum_{i=1}^{N} \log P(y_i | x_i, \theta_{\text{pre-trained}} + \Delta\theta)\]

Where:

\(\theta_{\text{pre-trained}}\): Pre-trained model parameters
\(\Delta\theta\): Parameter updates during fine-tuning
\(y_i, x_i\): Task-specific labeled examples
Updates are typically small (\(\Delta\theta\) is small)

LoRA (Low-Rank Adaptation)

\[W' = W + \Delta W = W + BA\]

Where:

\(W\): Original weight matrix (frozen)
\(B\): Low-rank matrix (trainable, rank r)
\(A\): Low-rank matrix (trainable, rank r)
Only \(B\) and \(A\) are trained, not \(W\)
Reduces trainable parameters significantly

Learning Rate Schedule

\[\text{lr}(t) = \begin{cases} \text{lr}_{\text{max}} \times \frac{t}{T_{\text{warmup}}} & \text{if } t < T_{\text{warmup}} \\ \text{lr}_{\text{max}} \times \left(1 - \frac{t - T_{\text{warmup}}}{T_{\text{total}} - T_{\text{warmup}}}\right) & \text{if } t \geq T_{\text{warmup}} \end{cases}\]

Warmup phase gradually increases learning rate, then linear decay. Prevents large gradient updates early in fine-tuning.

Detailed Examples

Example: Fine-tuning for Sentiment Analysis

Pre-trained model: GPT-2 (general language understanding)

Task: Classify movie reviews as positive or negative

Step 1: Prepare data

Training examples: ("This movie was amazing!", "positive")
Format: Review text → sentiment label

Step 2: Add classification head

Add linear layer on top of pre-trained model
Output: 2 classes (positive, negative)

Step 3: Fine-tune

Learning rate: 2e-5 (much lower than pre-training)
Freeze early layers, train later layers + classification head
Train for 3-5 epochs

Result: Model adapts general language knowledge to sentiment classification task.

Example: LoRA Fine-tuning

Original weight matrix: W (768×768) = 589,824 parameters

LoRA approach:

W (frozen): 589,824 parameters
B (768×8): 6,144 parameters (trainable)
A (8×768): 6,144 parameters (trainable)
Total trainable: 12,288 (2% of original!)

Benefits: Much less memory, faster training, can fine-tune on single GPU.

Implementation

Fine-tuning with HuggingFace

from transformers import GPT2ForSequenceClassification, GPT2Tokenizer, Trainer, TrainingArguments
from datasets import Dataset

# Load pre-trained model
model = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=2)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# Prepare data
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

# Example data
data = {
    "text": ["This movie was amazing!", "Terrible acting.", "It was okay."],
    "label": [1, 0, 0]  # 1=positive, 0=negative
}
dataset = Dataset.from_dict(data)
tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=2e-5,  # Low learning rate for fine-tuning
    warmup_steps=100,
    logging_dir="./logs",
)

# Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)
trainer.train()

LoRA Fine-tuning with PEFT

from transformers import AutoModelForCausalLM
from peft import LoraConfig, get_peft_model, TaskType

# Load model
model = AutoModelForCausalLM.from_pretrained("gpt2")

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,  # Rank (low-rank dimension)
    lora_alpha=16,
    lora_dropout=0.1,
    target_modules=["c_attn", "c_proj"]  # Which layers to apply LoRA to
)

# Apply LoRA
model = get_peft_model(model, lora_config)

# Now only LoRA parameters are trainable
print(f"Trainable parameters: {model.num_parameters(only_trainable=True)}")
print(f"Total parameters: {model.num_parameters()}")

# Fine-tune as usual (but with much fewer parameters)

Real-World Applications

Fine-tuning Use Cases

Domain-specific applications:

Medical: Fine-tune on medical literature for clinical applications
Legal: Fine-tune on legal documents for contract analysis
Code: Fine-tune on code repositories for programming assistance
Customer service: Fine-tune on support tickets for automated responses

Task-specific fine-tuning:

Sentiment analysis for product reviews
Named entity recognition for information extraction
Question answering for knowledge bases
Text classification for content moderation

When to Use Fine-tuning vs Prompting

Use fine-tuning when:

You have task-specific labeled data
You need high performance on specific domain
Prompting doesn't achieve desired accuracy
You can afford training time and resources

Use prompting when:

You have limited or no labeled data
You need quick iteration
Task is simple enough for few-shot learning
You want to avoid training overhead

Test Your Understanding

Question 1: What is the key architectural difference between BERT and GPT?

A) BERT is encoder-only with bidirectional attention, while GPT is decoder-only with unidirectional (causal) attention

B) BERT is decoder-only while GPT is encoder-only

C) They have identical architectures

D) BERT uses only feedforward layers while GPT uses only attention

Question 2: What is Masked Language Modeling (MLM) in BERT?

A) A pre-training objective where random tokens are masked and the model predicts them using bidirectional context

B) A method to hide model parameters

C) A technique for generating text

D) A way to reduce model size

Question 3: What is Next Sentence Prediction (NSP) used for in BERT?

A) To help the model understand relationships between sentences, improving performance on tasks like question answering and natural language inference

B) To generate the next sentence in a sequence

C) To translate sentences

D) To classify individual sentences

Question 4: What are some popular BERT variants?

A) RoBERTa, ALBERT, and DistilBERT

B) GPT-2, GPT-3, and GPT-4

C) LSTM and GRU

D) ResNet and VGG

Question 5: Why is BERT particularly well-suited for understanding tasks rather than generation tasks?

A) Because its bidirectional attention allows it to see context from both directions, making it better at understanding relationships and meaning

B) Because it has fewer parameters than GPT

C) Because it uses a different activation function

D) Because it trains faster

Question 6: What is the main component of BERT's architecture?

A) Transformer encoder layers with self-attention and feedforward networks

B) Convolutional layers

C) Recurrent layers (LSTM/GRU)

D) Only feedforward layers

Question 7: How does BERT handle input sequences?

A) It processes the entire sequence simultaneously using bidirectional attention, allowing each token to attend to all other tokens

B) It processes tokens sequentially from left to right only

C) It processes tokens sequentially from right to left only

D) It randomly processes tokens

Question 8: What special tokens does BERT use?

A) [CLS] for classification, [SEP] for separation, and [MASK] for masked tokens

B) Only [START] and [END] tokens

C) No special tokens

D) Only punctuation marks

Question 9: What is the purpose of the [CLS] token in BERT?

A) It aggregates sequence-level information and is often used as the representation for classification tasks

B) It marks the end of a sentence

C) It indicates the start of generation

D) It is used for masking tokens

Question 10: How is BERT typically fine-tuned for downstream tasks?

A) A task-specific head (like a classification layer) is added on top of the pre-trained BERT model, and the entire model is fine-tuned on task-specific data

B) Only the new head is trained while BERT is frozen

C) BERT is retrained from scratch for each task

D) BERT cannot be fine-tuned

Question 11: What makes DistilBERT different from BERT?

A) DistilBERT is a smaller, faster, and lighter version of BERT that uses knowledge distillation to achieve similar performance with fewer parameters

B) DistilBERT is larger than BERT

C) DistilBERT uses a different architecture (CNN instead of Transformer)

D) DistilBERT is only for generation tasks

Question 12: What types of tasks is BERT particularly good at?

A) Text classification, named entity recognition, question answering, sentiment analysis, and natural language inference

B) Only text generation

C) Only image classification

D) Only speech recognition