Chapter 7: Encoder Architecture

Understanding the Encoder Stack

Learning Objectives

Understand encoder architecture fundamentals
Master the mathematical foundations
Learn practical implementation
Apply knowledge through examples
Recognize real-world applications

Encoder Architecture

What is the Encoder?

The encoder is a stack of identical layers that process input sequences to create rich, contextualized representations. Each encoder layer refines the representations, building increasingly abstract and task-relevant features.

Think of the encoder like a team of editors:

Layer 1: Like a copy editor - fixes basic grammar and spelling
Layer 6: Like a content editor - understands meaning and context
Layer 12: Like a senior editor - understands deep relationships and nuances
Result: Each layer adds more understanding, building up to a complete representation

Encoder Layer Components

Each encoder layer consists of two main sublayers:

1. Multi-Head Self-Attention

Allows each position to attend to all positions
Learns relationships between words
Creates contextualized representations

2. Feed-Forward Network

Processes information at each position
Adds non-linearity and capacity
Transforms the attention output

Both sublayers use: Residual connections + Layer normalization

📚 Information Flow Through Layers

How representations evolve:

Layer 1: Detects local patterns (adjacent words, simple syntax)
Layer 3: Understands phrases and basic semantics
Layer 6: Captures sentence-level meaning and relationships
Layer 12: Understands complex semantics, long-range dependencies, task-specific features

Key Concepts

🔑 Complete Encoder Layer Structure

Each encoder layer follows this structure (pre-norm):

Step-by-Step Flow

Input: x (from previous layer or embeddings)
Layer Norm: Normalize x
Multi-Head Attention: Apply self-attention
Residual: x + attention_output
Layer Norm: Normalize again
FFN: Apply feed-forward network
Residual: previous + ffn_output
Output: Refined representation

Stacking Layers

Multiple encoder layers are stacked:

Output of layer N becomes input to layer N+1
Each layer refines and builds upon previous representations
Like a pipeline: raw input → refined → more refined → final representation

Common Configurations

BERT-base: 12 encoder layers
BERT-large: 24 encoder layers
GPT-2: 12-48 encoder layers (decoder-only, but similar structure)
Original Transformer: 6 encoder layers

Bidirectional Processing

Encoder processes sequences bidirectionally:

Each position can attend to ALL positions (left and right)
Unlike decoders which are causal (only left-to-right)
Enables understanding full context
Perfect for tasks like classification, NER, Q&A

Mathematical Formulations

Single Encoder Layer

\[\text{EncoderLayer}(x) = x + \text{FFN}(\text{LayerNorm}(x + \text{MultiHeadAttention}(\text{LayerNorm}(x))))\]

Breaking Down:

LayerNorm(x): Normalize input
MultiHeadAttention(...): Self-attention
x + ...: Residual connection
LayerNorm(...): Normalize again
FFN(...): Feed-forward network
x + ...: Final residual connection

Complete Encoder Stack

\[\text{Encoder}(x) = \text{EncoderLayer}_N(\ldots\text{EncoderLayer}_2(\text{EncoderLayer}_1(x)))\]

How Layers Stack:

Input embeddings → Layer 1 → Layer 2 → ... → Layer N
Each layer's output becomes next layer's input
Final output: Rich, contextualized representations

Detailed Examples

Example: Processing "The cat sat on the mat"

Let's trace through a 3-layer encoder:

Input (After Embeddings + Positional Encoding)

Each word: [embedding + position] (512 dimensions)
"The": [0.1, 0.2, ..., 0.5]
"cat": [0.3, 0.4, ..., 0.7]
... (all 6 words)

After Layer 1 (Local Patterns)

Attention learns: "sat" attends to "cat" (subject-verb)
Attention learns: "on" attends to "sat" (preposition-verb)
FFN processes these local relationships
Output: Basic syntactic patterns detected

After Layer 3 (Phrase Understanding)

Attention learns: "The cat" as a noun phrase
Attention learns: "sat on the mat" as a verb phrase
FFN processes phrase-level semantics
Output: Phrase structures understood

After Layer 6 (Sentence Meaning)

Attention learns: Complete sentence structure
Attention learns: Semantic relationships (cat → animal, mat → object)
FFN processes sentence-level meaning
Output: Rich semantic representation ready for tasks

Implementation

Complete Encoder Implementation

import numpy as np

class EncoderLayer:
    """Single encoder layer"""
    
    def __init__(self, d_model, num_heads, d_ff):
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_ff = d_ff
        
        # Multi-head attention (simplified - would need full implementation)
        # self.attention = MultiHeadAttention(d_model, num_heads)
        
        # Feed-forward network
        # self.ffn = FeedForwardNetwork(d_model, d_ff)
        
        # Layer normalizations
        self.layer_norm1 = LayerNormalization(d_model)
        self.layer_norm2 = LayerNormalization(d_model)
    
    def forward(self, x, attention_fn, ffn_fn):
        """
        Forward pass through encoder layer
        
        Parameters:
        x: Input (batch, seq_len, d_model)
        attention_fn: Function for multi-head attention
        ffn_fn: Function for feed-forward network
        """
        # Sublayer 1: Multi-head self-attention with residual
        x_norm1 = self.layer_norm1.forward(x)
        attention_output = attention_fn(x_norm1)
        x = x + attention_output  # Residual connection
        
        # Sublayer 2: FFN with residual
        x_norm2 = self.layer_norm2.forward(x)
        ffn_output = ffn_fn(x_norm2)
        x = x + ffn_output  # Residual connection
        
        return x

class Encoder:
    """Complete encoder stack"""
    
    def __init__(self, num_layers, d_model, num_heads, d_ff):
        """
        Initialize encoder
        
        Parameters:
        num_layers: Number of encoder layers
        d_model: Model dimension
        num_heads: Number of attention heads
        d_ff: Feed-forward dimension
        """
        self.num_layers = num_layers
        self.layers = [
            EncoderLayer(d_model, num_heads, d_ff)
            for _ in range(num_layers)
        ]
    
    def forward(self, x, attention_fn, ffn_fn):
        """
        Forward pass through encoder stack
        
        Parameters:
        x: Input embeddings (batch, seq_len, d_model)
        attention_fn: Function for attention
        ffn_fn: Function for FFN
        """
        # Process through each layer sequentially
        for layer in self.layers:
            x = layer.forward(x, attention_fn, ffn_fn)
        
        return x

# Example usage
num_layers, d_model, num_heads, d_ff = 12, 512, 8, 2048
encoder = Encoder(num_layers, d_model, num_heads, d_ff)

# Input: (batch=2, seq_len=10, d_model=512)
x = np.random.randn(2, 10, 512)

# Forward pass (would need actual attention_fn and ffn_fn)
# output = encoder.forward(x, attention_fn, ffn_fn)
print(f"Encoder with {num_layers} layers created")
print(f"Input shape: {x.shape}")

Real-World Applications

Encoder-Only Models

Encoder architecture is used in many important models:

1. BERT (Bidirectional Encoder Representations)

12-24 encoder layers
Bidirectional processing (sees full context)
Used for: Classification, NER, Q&A, sentence similarity
Revolutionary for understanding tasks

2. RoBERTa (Robust BERT)

Improved training of BERT
Same encoder architecture
Better performance through better training

3. ALBERT (A Lite BERT)

Parameter sharing across layers
More efficient than BERT
Same encoder structure, shared weights

What Each Layer Learns

Research shows layers specialize:

Early layers (1-3): Syntax, POS tags, local patterns
Middle layers (4-8): Semantics, phrase structure, relationships
Deep layers (9-12): Task-specific features, complex reasoning

Test Your Understanding

Question 1: How many sublayers does each encoder layer have?

A) 1

B) 2 (Multi-head attention and FFN)

C) 3

D) 4

Question 2: What is the key difference between encoder and decoder?

A) Encoder is bidirectional, decoder is causal (left-to-right only)

B) Encoder has more layers

C) Encoder doesn't use attention

D) They are the same

Question 3: What happens to representations as they go through more encoder layers?

A) They become more abstract and task-specific

B) They become simpler

C) They stay the same

D) They become random

Question 4: How does the encoder process input bidirectionally?

A) Encoder uses self-attention that allows each token to attend to all other tokens in both directions simultaneously, creating representations that incorporate context from the entire sequence

B) Only left to right

C) Only right to left

D) Randomly

Question 5: What are the main components of an encoder layer?

A) Multi-head self-attention, feed-forward network, residual connections, and layer normalization, typically in pre-norm or post-norm configuration

B) Only attention

C) Only FFN

D) Random components

Question 6: How do multiple encoder layers create deeper understanding?

A) Each layer refines representations from the previous layer, with early layers capturing local patterns and syntax, while deeper layers capture semantic relationships and high-level abstractions

B) They're all the same

C) Only first layer matters

D) Only last layer matters

Question 7: What is the difference between encoder-only and encoder-decoder architectures?

A) Encoder-only (like BERT) processes input for understanding tasks, while encoder-decoder (like T5) uses encoder for input and decoder for generation, suitable for seq2seq tasks

B) They're the same

C) Encoder-only generates

D) Encoder-decoder only encodes

Question 8: How would you implement an encoder stack from scratch?

A) Stack N encoder layers, each with self-attention and FFN. Add positional encoding to input, pass through layers with residual connections and layer norm. Output contextualized representations

B) Just one layer

C) No attention needed

D) Random layers

Question 9: What happens to information as it flows through encoder layers?

A) Information gets progressively refined and abstracted. Early layers focus on word-level and local patterns, middle layers capture phrase-level relationships, deeper layers capture document-level semantics and complex relationships

B) It stays the same

C) It gets simpler

D) It's lost

Question 10: Why do encoder models like BERT use [CLS] token?

A) The [CLS] token's final representation aggregates information from the entire sequence, making it useful as a sentence-level representation for classification tasks

B) It's not used

C) It's just padding

D) No reason

Question 11: How does encoder architecture scale with sequence length?

A) Self-attention has O(n²) complexity, so computation grows quadratically with sequence length. This limits practical sequence lengths, leading to techniques like sparse attention or chunking for longer sequences

B) Linear scaling

C) Constant time

D) No scaling issues

Question 12: What tasks are encoder-only models best suited for?

A) Understanding tasks like classification, named entity recognition, question answering, sentiment analysis, where you need to process and understand input rather than generate new text

B) Only generation

C) Only translation

D) All tasks equally