Chapter 7: Encoder Architecture
Understanding the Encoder Stack
Learning Objectives
- Understand encoder architecture fundamentals
- Master the mathematical foundations
- Learn practical implementation
- Apply knowledge through examples
- Recognize real-world applications
Encoder Architecture
What is the Encoder?
The encoder is a stack of identical layers that process input sequences to create rich, contextualized representations. Each encoder layer refines the representations, building increasingly abstract and task-relevant features.
Think of the encoder like a team of editors:
- Layer 1: Like a copy editor - fixes basic grammar and spelling
- Layer 6: Like a content editor - understands meaning and context
- Layer 12: Like a senior editor - understands deep relationships and nuances
- Result: Each layer adds more understanding, building up to a complete representation
Encoder Layer Components
Each encoder layer consists of two main sublayers:
1. Multi-Head Self-Attention
- Allows each position to attend to all positions
- Learns relationships between words
- Creates contextualized representations
2. Feed-Forward Network
- Processes information at each position
- Adds non-linearity and capacity
- Transforms the attention output
Both sublayers use: Residual connections + Layer normalization
š Information Flow Through Layers
How representations evolve:
- Layer 1: Detects local patterns (adjacent words, simple syntax)
- Layer 3: Understands phrases and basic semantics
- Layer 6: Captures sentence-level meaning and relationships
- Layer 12: Understands complex semantics, long-range dependencies, task-specific features
Key Concepts
š Complete Encoder Layer Structure
Each encoder layer follows this structure (pre-norm):
Step-by-Step Flow
- Input: x (from previous layer or embeddings)
- Layer Norm: Normalize x
- Multi-Head Attention: Apply self-attention
- Residual: x + attention_output
- Layer Norm: Normalize again
- FFN: Apply feed-forward network
- Residual: previous + ffn_output
- Output: Refined representation
Stacking Layers
Multiple encoder layers are stacked:
- Output of layer N becomes input to layer N+1
- Each layer refines and builds upon previous representations
- Like a pipeline: raw input ā refined ā more refined ā final representation
Common Configurations
- BERT-base: 12 encoder layers
- BERT-large: 24 encoder layers
- GPT-2: 12-48 encoder layers (decoder-only, but similar structure)
- Original Transformer: 6 encoder layers
Bidirectional Processing
Encoder processes sequences bidirectionally:
- Each position can attend to ALL positions (left and right)
- Unlike decoders which are causal (only left-to-right)
- Enables understanding full context
- Perfect for tasks like classification, NER, Q&A
Mathematical Formulations
Single Encoder Layer
\[\text{EncoderLayer}(x) = x + \text{FFN}(\text{LayerNorm}(x + \text{MultiHeadAttention}(\text{LayerNorm}(x))))\]
Breaking Down:
- LayerNorm(x): Normalize input
- MultiHeadAttention(...): Self-attention
- x + ...: Residual connection
- LayerNorm(...): Normalize again
- FFN(...): Feed-forward network
- x + ...: Final residual connection
Complete Encoder Stack
\[\text{Encoder}(x) = \text{EncoderLayer}_N(\ldots\text{EncoderLayer}_2(\text{EncoderLayer}_1(x)))\]
How Layers Stack:
- Input embeddings ā Layer 1 ā Layer 2 ā ... ā Layer N
- Each layer's output becomes next layer's input
- Final output: Rich, contextualized representations
Detailed Examples
Example: Processing "The cat sat on the mat"
Let's trace through a 3-layer encoder:
Input (After Embeddings + Positional Encoding)
- Each word: [embedding + position] (512 dimensions)
- "The": [0.1, 0.2, ..., 0.5]
- "cat": [0.3, 0.4, ..., 0.7]
- ... (all 6 words)
After Layer 1 (Local Patterns)
- Attention learns: "sat" attends to "cat" (subject-verb)
- Attention learns: "on" attends to "sat" (preposition-verb)
- FFN processes these local relationships
- Output: Basic syntactic patterns detected
After Layer 3 (Phrase Understanding)
- Attention learns: "The cat" as a noun phrase
- Attention learns: "sat on the mat" as a verb phrase
- FFN processes phrase-level semantics
- Output: Phrase structures understood
After Layer 6 (Sentence Meaning)
- Attention learns: Complete sentence structure
- Attention learns: Semantic relationships (cat ā animal, mat ā object)
- FFN processes sentence-level meaning
- Output: Rich semantic representation ready for tasks
Implementation
Complete Encoder Implementation
import numpy as np
class EncoderLayer:
"""Single encoder layer"""
def __init__(self, d_model, num_heads, d_ff):
self.d_model = d_model
self.num_heads = num_heads
self.d_ff = d_ff
# Multi-head attention (simplified - would need full implementation)
# self.attention = MultiHeadAttention(d_model, num_heads)
# Feed-forward network
# self.ffn = FeedForwardNetwork(d_model, d_ff)
# Layer normalizations
self.layer_norm1 = LayerNormalization(d_model)
self.layer_norm2 = LayerNormalization(d_model)
def forward(self, x, attention_fn, ffn_fn):
"""
Forward pass through encoder layer
Parameters:
x: Input (batch, seq_len, d_model)
attention_fn: Function for multi-head attention
ffn_fn: Function for feed-forward network
"""
# Sublayer 1: Multi-head self-attention with residual
x_norm1 = self.layer_norm1.forward(x)
attention_output = attention_fn(x_norm1)
x = x + attention_output # Residual connection
# Sublayer 2: FFN with residual
x_norm2 = self.layer_norm2.forward(x)
ffn_output = ffn_fn(x_norm2)
x = x + ffn_output # Residual connection
return x
class Encoder:
"""Complete encoder stack"""
def __init__(self, num_layers, d_model, num_heads, d_ff):
"""
Initialize encoder
Parameters:
num_layers: Number of encoder layers
d_model: Model dimension
num_heads: Number of attention heads
d_ff: Feed-forward dimension
"""
self.num_layers = num_layers
self.layers = [
EncoderLayer(d_model, num_heads, d_ff)
for _ in range(num_layers)
]
def forward(self, x, attention_fn, ffn_fn):
"""
Forward pass through encoder stack
Parameters:
x: Input embeddings (batch, seq_len, d_model)
attention_fn: Function for attention
ffn_fn: Function for FFN
"""
# Process through each layer sequentially
for layer in self.layers:
x = layer.forward(x, attention_fn, ffn_fn)
return x
# Example usage
num_layers, d_model, num_heads, d_ff = 12, 512, 8, 2048
encoder = Encoder(num_layers, d_model, num_heads, d_ff)
# Input: (batch=2, seq_len=10, d_model=512)
x = np.random.randn(2, 10, 512)
# Forward pass (would need actual attention_fn and ffn_fn)
# output = encoder.forward(x, attention_fn, ffn_fn)
print(f"Encoder with {num_layers} layers created")
print(f"Input shape: {x.shape}")
Real-World Applications
Encoder-Only Models
Encoder architecture is used in many important models:
1. BERT (Bidirectional Encoder Representations)
- 12-24 encoder layers
- Bidirectional processing (sees full context)
- Used for: Classification, NER, Q&A, sentence similarity
- Revolutionary for understanding tasks
2. RoBERTa (Robust BERT)
- Improved training of BERT
- Same encoder architecture
- Better performance through better training
3. ALBERT (A Lite BERT)
- Parameter sharing across layers
- More efficient than BERT
- Same encoder structure, shared weights
What Each Layer Learns
Research shows layers specialize:
- Early layers (1-3): Syntax, POS tags, local patterns
- Middle layers (4-8): Semantics, phrase structure, relationships
- Deep layers (9-12): Task-specific features, complex reasoning
Test Your Understanding
Question 1: How many sublayers does each encoder layer have?
A) 1
B) 2 (Multi-head attention and FFN)
C) 3
D) 4
Question 2: What is the key difference between encoder and decoder?
A) Encoder is bidirectional, decoder is causal (left-to-right only)
B) Encoder has more layers
C) Encoder doesn't use attention
D) They are the same
Question 3: What happens to representations as they go through more encoder layers?
A) They become more abstract and task-specific
B) They become simpler
C) They stay the same
D) They become random
Question 4: How does the encoder process input bidirectionally?
A) Encoder uses self-attention that allows each token to attend to all other tokens in both directions simultaneously, creating representations that incorporate context from the entire sequence
B) Only left to right
C) Only right to left
D) Randomly
Question 5: What are the main components of an encoder layer?
A) Multi-head self-attention, feed-forward network, residual connections, and layer normalization, typically in pre-norm or post-norm configuration
B) Only attention
C) Only FFN
D) Random components
Question 6: How do multiple encoder layers create deeper understanding?
A) Each layer refines representations from the previous layer, with early layers capturing local patterns and syntax, while deeper layers capture semantic relationships and high-level abstractions
B) They're all the same
C) Only first layer matters
D) Only last layer matters
Question 7: What is the difference between encoder-only and encoder-decoder architectures?
A) Encoder-only (like BERT) processes input for understanding tasks, while encoder-decoder (like T5) uses encoder for input and decoder for generation, suitable for seq2seq tasks
B) They're the same
C) Encoder-only generates
D) Encoder-decoder only encodes
Question 8: How would you implement an encoder stack from scratch?
A) Stack N encoder layers, each with self-attention and FFN. Add positional encoding to input, pass through layers with residual connections and layer norm. Output contextualized representations
B) Just one layer
C) No attention needed
D) Random layers
Question 9: What happens to information as it flows through encoder layers?
A) Information gets progressively refined and abstracted. Early layers focus on word-level and local patterns, middle layers capture phrase-level relationships, deeper layers capture document-level semantics and complex relationships
B) It stays the same
C) It gets simpler
D) It's lost
Question 10: Why do encoder models like BERT use [CLS] token?
A) The [CLS] token's final representation aggregates information from the entire sequence, making it useful as a sentence-level representation for classification tasks
B) It's not used
C) It's just padding
D) No reason
Question 11: How does encoder architecture scale with sequence length?
A) Self-attention has O(n²) complexity, so computation grows quadratically with sequence length. This limits practical sequence lengths, leading to techniques like sparse attention or chunking for longer sequences
B) Linear scaling
C) Constant time
D) No scaling issues
Question 12: What tasks are encoder-only models best suited for?
A) Understanding tasks like classification, named entity recognition, question answering, sentiment analysis, where you need to process and understand input rather than generate new text
B) Only generation
C) Only translation
D) All tasks equally