Chapter 7: Encoder Architecture
Encoder Architecture in Transformer Architecture Deep Dive.
Learning Objectives
By the end of this chapter, you will be able to:
- Explain the transformer component behind Encoder Architecture.
- Trace how Encoder Architecture contributes to sequence modeling.
- Recognize the implementation trade-offs behind transformer architectures.
Chapter 7: Encoder Architecture
Understanding the Encoder Stack
Encoder Architecture
What is the Encoder?
The encoder is a stack of identical layers that process input sequences to create rich, contextualized representations. Each encoder layer refines the representations, building increasingly abstract and task-relevant features.
Think of the encoder like a team of editors:
- Layer 1: Like a copy editor - fixes basic grammar and spelling
- Layer 6: Like a content editor - understands meaning and context
- Layer 12: Like a senior editor - understands deep relationships and nuances
- Result: Each layer adds more understanding, building up to a complete representation
Encoder Layer Components
Each encoder layer consists of two main sublayers:
1. Multi-Head Self-Attention
- Allows each position to attend to all positions
- Learns relationships between words
- Creates contextualized representations
2. Feed-Forward Network
- Processes information at each position
- Adds non-linearity and capacity
- Transforms the attention output
Both sublayers use: Residual connections + Layer normalization
š Information Flow Through Layers
How representations evolve:
- Layer 1: Detects local patterns (adjacent words, simple syntax)
- Layer 3: Understands phrases and basic semantics
- Layer 6: Captures sentence-level meaning and relationships
- Layer 12: Understands complex semantics, long-range dependencies, task-specific features
Key Concepts
š Complete Encoder Layer Structure
Each encoder layer follows this structure (pre-norm):
Step-by-Step Flow
- Input: x (from previous layer or embeddings)
- Layer Norm: Normalize x
- Multi-Head Attention: Apply self-attention
- Residual: x + attention_output
- Layer Norm: Normalize again
- FFN: Apply feed-forward network
- Residual: previous + ffn_output
- Output: Refined representation
Stacking Layers
Multiple encoder layers are stacked:
- Output of layer N becomes input to layer N+1
- Each layer refines and builds upon previous representations
- Like a pipeline: raw input ā refined ā more refined ā final representation
Common Configurations
- BERT-base: 12 encoder layers
- BERT-large: 24 encoder layers
- GPT-2: 12-48 encoder layers (decoder-only, but similar structure)
- Original Transformer: 6 encoder layers
Bidirectional Processing
Encoder processes sequences bidirectionally:
- Each position can attend to ALL positions (left and right)
- Unlike decoders which are causal (only left-to-right)
- Enables understanding full context
- Perfect for tasks like classification, NER, Q&A
Mathematical Formulations
Single Encoder Layer
Breaking Down:
- LayerNorm(x): Normalize input
- MultiHeadAttention(...): Self-attention
- x + ...: Residual connection
- LayerNorm(...): Normalize again
- FFN(...): Feed-forward network
- x + ...: Final residual connection
Complete Encoder Stack
How Layers Stack:
- Input embeddings ā Layer 1 ā Layer 2 ā ... ā Layer N
- Each layer's output becomes next layer's input
- Final output: Rich, contextualized representations
Detailed Examples
Example: Processing "The cat sat on the mat"
Let's trace through a 3-layer encoder:
Input (After Embeddings + Positional Encoding)
- Each word: [embedding + position] (512 dimensions)
- "The": [0.1, 0.2, ..., 0.5]
- "cat": [0.3, 0.4, ..., 0.7]
- ... (all 6 words)
After Layer 1 (Local Patterns)
- Attention learns: "sat" attends to "cat" (subject-verb)
- Attention learns: "on" attends to "sat" (preposition-verb)
- FFN processes these local relationships
- Output: Basic syntactic patterns detected
After Layer 3 (Phrase Understanding)
- Attention learns: "The cat" as a noun phrase
- Attention learns: "sat on the mat" as a verb phrase
- FFN processes phrase-level semantics
- Output: Phrase structures understood
After Layer 6 (Sentence Meaning)
- Attention learns: Complete sentence structure
- Attention learns: Semantic relationships (cat ā animal, mat ā object)
- FFN processes sentence-level meaning
- Output: Rich semantic representation ready for tasks
Implementation
Complete Encoder Implementation
import numpy as np
class EncoderLayer:
"""Single encoder layer"""
def __init__(self, d_model, num_heads, d_ff):
self.d_model = d_model
self.num_heads = num_heads
self.d_ff = d_ff
# Multi-head attention (simplified - would need full implementation)
# self.attention = MultiHeadAttention(d_model, num_heads)
# Feed-forward network
# self.ffn = FeedForwardNetwork(d_model, d_ff)
# Layer normalizations
self.layer_norm1 = LayerNormalization(d_model)
self.layer_norm2 = LayerNormalization(d_model)
def forward(self, x, attention_fn, ffn_fn):
"""
Forward pass through encoder layer
Parameters:
x: Input (batch, seq_len, d_model)
attention_fn: Function for multi-head attention
ffn_fn: Function for feed-forward network
"""
# Sublayer 1: Multi-head self-attention with residual
x_norm1 = self.layer_norm1.forward(x)
attention_output = attention_fn(x_norm1)
x = x + attention_output # Residual connection
# Sublayer 2: FFN with residual
x_norm2 = self.layer_norm2.forward(x)
ffn_output = ffn_fn(x_norm2)
x = x + ffn_output # Residual connection
return x
class Encoder:
"""Complete encoder stack"""
def __init__(self, num_layers, d_model, num_heads, d_ff):
"""
Initialize encoder
Parameters:
num_layers: Number of encoder layers
d_model: Model dimension
num_heads: Number of attention heads
d_ff: Feed-forward dimension
"""
self.num_layers = num_layers
self.layers = [
EncoderLayer(d_model, num_heads, d_ff)
for _ in range(num_layers)
]
def forward(self, x, attention_fn, ffn_fn):
"""
Forward pass through encoder stack
Parameters:
x: Input embeddings (batch, seq_len, d_model)
attention_fn: Function for attention
ffn_fn: Function for FFN
"""
# Process through each layer sequentially
for layer in self.layers:
x = layer.forward(x, attention_fn, ffn_fn)
return x
# Example usage
num_layers, d_model, num_heads, d_ff = 12, 512, 8, 2048
encoder = Encoder(num_layers, d_model, num_heads, d_ff)
# Input: (batch=2, seq_len=10, d_model=512)
x = np.random.randn(2, 10, 512)
# Forward pass (would need actual attention_fn and ffn_fn)
# output = encoder.forward(x, attention_fn, ffn_fn)
print(f"Encoder with {num_layers} layers created")
print(f"Input shape: {x.shape}")
Real-World Applications
Encoder-Only Models
Encoder architecture is used in many important models:
1. BERT (Bidirectional Encoder Representations)
- 12-24 encoder layers
- Bidirectional processing (sees full context)
- Used for: Classification, NER, Q&A, sentence similarity
- Revolutionary for understanding tasks
2. RoBERTa (Robust BERT)
- Improved training of BERT
- Same encoder architecture
- Better performance through better training
3. ALBERT (A Lite BERT)
- Parameter sharing across layers
- More efficient than BERT
- Same encoder structure, shared weights
What Each Layer Learns
Research shows layers specialize:
- Early layers (1-3): Syntax, POS tags, local patterns
- Middle layers (4-8): Semantics, phrase structure, relationships
- Deep layers (9-12): Task-specific features, complex reasoning