Chapter 8: Decoder Architecture
Decoder Architecture in Transformer Architecture Deep Dive.
Learning Objectives
By the end of this chapter, you will be able to:
- Explain the transformer component behind Decoder Architecture.
- Trace how Decoder Architecture contributes to sequence modeling.
- Recognize the implementation trade-offs behind transformer architectures.
Chapter 8: Decoder Architecture
Generating Sequences
Decoder Architecture
What is the Decoder?
The decoder generates output sequences one token at a time, using both the encoder's output and previously generated tokens. Unlike the encoder which processes the entire input at once, the decoder generates autoregressively (left-to-right).
Think of the decoder like writing a translation:
- Encoder: Reads and understands the entire source sentence
- Decoder: Writes the translation word by word, using both the source understanding and what it has written so far
- Key constraint: Can only see previous words when generating (causal masking)
Decoder vs Encoder
Encoder (Bidirectional)
- Processes entire input at once
- Can attend to all positions (left and right)
- Used for understanding tasks
- Example: BERT, classification models
Decoder (Causal/Autoregressive)
- Generates one token at a time
- Can only attend to previous positions (causal)
- Used for generation tasks
- Example: GPT, translation, summarization
📚 Real-World Analogy: Writing a Story
Imagine writing a story based on a prompt:
- Encoder: Reads and understands the prompt completely
- Decoder Step 1: Generates first word using prompt understanding
- Decoder Step 2: Generates second word using prompt + first word
- Decoder Step 3: Generates third word using prompt + first two words
- Result: Complete story generated word by word
Key Concepts
🔑 Decoder Layer Components
Each decoder layer has THREE sublayers (vs encoder's two):
1. Masked Multi-Head Self-Attention
- Self-attention over decoder's own output
- Causal masking: Can only attend to previous positions
- Prevents "cheating" by looking at future tokens
- Like reading a book - you can only see pages you've already read
2. Encoder-Decoder Attention
- Attention from decoder to encoder output
- Query from decoder, Key/Value from encoder
- Allows decoder to "look back" at source
- Like a translator looking at the source text while writing
3. Feed-Forward Network
- Same as encoder FFN
- Processes information at each position
- Adds non-linearity and capacity
🚫 Causal Masking Explained
Causal masking ensures the decoder can't see future tokens:
Example: Generating "Hello world"
When generating position 1 ("Hello"):
- Can attend to: position 0 (start token)
- Cannot attend to: position 2 ("world") - hasn't been generated yet!
When generating position 2 ("world"):
- Can attend to: position 0 (start), position 1 ("Hello")
- Cannot attend to: future positions
Mask Matrix
For sequence of length 3:
Attention Mask: [1 0 0] ← Position 0 can only see itself [1 1 0] ← Position 1 can see 0 and 1 [1 1 1] ← Position 2 can see 0, 1, and 2
0 = masked (cannot attend), 1 = allowed
Mathematical Formulations
Causal Masking
Meaning:
- Position i can attend to position j only if j ≤ i
- Prevents attending to future positions
- Applied before softmax in attention
Masked Attention Scores
How It Works:
- Add -∞ to masked positions
- After softmax: masked positions → 0 (exp(-∞) = 0)
- Result: Cannot attend to future positions
Decoder Layer Formula
Three Sublayers:
- MaskedSelfAttention: Causal self-attention
- EncDecAttention: Attention to encoder output
- FFN: Feed-forward network
Detailed Examples
Example: Generating Translation
Translating "The cat" → "Le chat" (French):
Step 1: Encoder Processes Source
- Encoder reads: "The cat"
- Creates rich representations for both words
- Output: Encoder representations ready for decoder
Step 2: Decoder Generates First Word
- Input: [START] token
- Masked self-attention: Only sees [START] (causal)
- Encoder-decoder attention: Looks at "The cat" from encoder
- FFN: Processes combined information
- Output: "Le" (first word generated)
Step 3: Decoder Generates Second Word
- Input: [START] "Le"
- Masked self-attention: Can see [START] and "Le" (but not future)
- Encoder-decoder attention: Still looks at "The cat"
- FFN: Processes information
- Output: "chat" (second word generated)
Implementation
Causal Masking Implementation
import numpy as np
def create_causal_mask(seq_len):
"""
Create causal mask for decoder
Parameters:
seq_len: Sequence length
Returns:
Mask matrix (seq_len, seq_len) where 1 = allowed, 0 = masked
"""
mask = np.triu(np.ones((seq_len, seq_len)), k=1)
# Invert: 1 = allowed, 0 = masked
mask = 1 - mask
return mask
def apply_causal_mask(scores, mask):
"""
Apply causal mask to attention scores
Parameters:
scores: Attention scores (batch, seq_len, seq_len)
mask: Causal mask (seq_len, seq_len)
Returns:
Masked scores
"""
# Set masked positions to -infinity
masked_scores = scores + (1 - mask) * (-1e9)
return masked_scores
# Example: Causal mask for sequence of length 4
seq_len = 4
causal_mask = create_causal_mask(seq_len)
print("Causal Mask:")
print(causal_mask)
# Output:
# [[1. 0. 0. 0.]
# [1. 1. 0. 0.]
# [1. 1. 1. 0.]
# [1. 1. 1. 1.]]
Decoder Layer Implementation
import numpy as np
class DecoderLayer:
"""Single decoder layer"""
def __init__(self, d_model, num_heads, d_ff):
self.d_model = d_model
self.num_heads = num_heads
self.d_ff = d_ff
# Layer normalizations (3 sublayers = 3 layer norms)
self.layer_norm1 = LayerNormalization(d_model)
self.layer_norm2 = LayerNormalization(d_model)
self.layer_norm3 = LayerNormalization(d_model)
def forward(self, x, encoder_output, masked_attention_fn, encdec_attention_fn, ffn_fn):
"""
Forward pass through decoder layer
Parameters:
x: Decoder input (batch, seq_len, d_model)
encoder_output: Encoder output (batch, enc_seq_len, d_model)
masked_attention_fn: Function for masked self-attention
encdec_attention_fn: Function for encoder-decoder attention
ffn_fn: Function for FFN
"""
# Sublayer 1: Masked self-attention
x_norm1 = self.layer_norm1.forward(x)
masked_attn_output = masked_attention_fn(x_norm1)
x = x + masked_attn_output
# Sublayer 2: Encoder-decoder attention
x_norm2 = self.layer_norm2.forward(x)
encdec_attn_output = encdec_attention_fn(x_norm2, encoder_output)
x = x + encdec_attn_output
# Sublayer 3: FFN
x_norm3 = self.layer_norm3.forward(x)
ffn_output = ffn_fn(x_norm3)
x = x + ffn_output
return x
# Example usage
d_model, num_heads, d_ff = 512, 8, 2048
decoder_layer = DecoderLayer(d_model, num_heads, d_ff)
# Decoder input: (batch=2, dec_seq_len=5, d_model=512)
decoder_input = np.random.randn(2, 5, 512)
# Encoder output: (batch=2, enc_seq_len=10, d_model=512)
encoder_output = np.random.randn(2, 10, 512)
print(f"Decoder layer created")
print(f"Decoder input shape: {decoder_input.shape}")
print(f"Encoder output shape: {encoder_output.shape}")
Real-World Applications
Decoder-Only Models
Decoder architecture is used in generation models:
1. GPT Models (Decoder-Only)
- GPT-1, GPT-2, GPT-3, GPT-4
- Use decoder layers (without encoder-decoder attention)
- Autoregressive text generation
- Revolutionary for language modeling
2. Machine Translation
- Encoder-decoder architecture
- Encoder processes source language
- Decoder generates target language
- State-of-the-art translation systems
3. Text Summarization
- Encoder reads long document
- Decoder generates summary
- Autoregressive generation
Causal Masking Importance
Why causal masking is critical:
- Prevents cheating: Model can't use future information
- Realistic generation: Mimics how humans generate text
- Training consistency: Training and inference match
- Without masking: Model would learn to "cheat" and fail at inference