Chapter 8: Decoder Architecture
Generating Sequences
Learning Objectives
- Understand decoder architecture fundamentals
- Master the mathematical foundations
- Learn practical implementation
- Apply knowledge through examples
- Recognize real-world applications
Decoder Architecture
What is the Decoder?
The decoder generates output sequences one token at a time, using both the encoder's output and previously generated tokens. Unlike the encoder which processes the entire input at once, the decoder generates autoregressively (left-to-right).
Think of the decoder like writing a translation:
- Encoder: Reads and understands the entire source sentence
- Decoder: Writes the translation word by word, using both the source understanding and what it has written so far
- Key constraint: Can only see previous words when generating (causal masking)
Decoder vs Encoder
Encoder (Bidirectional)
- Processes entire input at once
- Can attend to all positions (left and right)
- Used for understanding tasks
- Example: BERT, classification models
Decoder (Causal/Autoregressive)
- Generates one token at a time
- Can only attend to previous positions (causal)
- Used for generation tasks
- Example: GPT, translation, summarization
📚 Real-World Analogy: Writing a Story
Imagine writing a story based on a prompt:
- Encoder: Reads and understands the prompt completely
- Decoder Step 1: Generates first word using prompt understanding
- Decoder Step 2: Generates second word using prompt + first word
- Decoder Step 3: Generates third word using prompt + first two words
- Result: Complete story generated word by word
Key Concepts
🔑 Decoder Layer Components
Each decoder layer has THREE sublayers (vs encoder's two):
1. Masked Multi-Head Self-Attention
- Self-attention over decoder's own output
- Causal masking: Can only attend to previous positions
- Prevents "cheating" by looking at future tokens
- Like reading a book - you can only see pages you've already read
2. Encoder-Decoder Attention
- Attention from decoder to encoder output
- Query from decoder, Key/Value from encoder
- Allows decoder to "look back" at source
- Like a translator looking at the source text while writing
3. Feed-Forward Network
- Same as encoder FFN
- Processes information at each position
- Adds non-linearity and capacity
🚫 Causal Masking Explained
Causal masking ensures the decoder can't see future tokens:
Example: Generating "Hello world"
When generating position 1 ("Hello"):
- Can attend to: position 0 (start token)
- Cannot attend to: position 2 ("world") - hasn't been generated yet!
When generating position 2 ("world"):
- Can attend to: position 0 (start), position 1 ("Hello")
- Cannot attend to: future positions
Mask Matrix
For sequence of length 3:
Attention Mask: [1 0 0] ← Position 0 can only see itself [1 1 0] ← Position 1 can see 0 and 1 [1 1 1] ← Position 2 can see 0, 1, and 2
0 = masked (cannot attend), 1 = allowed
Mathematical Formulations
Causal Masking
\[\text{Mask}[i, j] = \begin{cases} 1 & \text{if } j \leq i \\ 0 & \text{if } j > i \end{cases}\]
Meaning:
- Position i can attend to position j only if j ≤ i
- Prevents attending to future positions
- Applied before softmax in attention
Masked Attention Scores
\[\text{scores}_{\text{masked}} = \text{scores} + \text{mask} \times (-\infty)\]
\[\text{attention} = \text{softmax}(\text{scores}_{\text{masked}})\]
How It Works:
- Add -∞ to masked positions
- After softmax: masked positions → 0 (exp(-∞) = 0)
- Result: Cannot attend to future positions
Decoder Layer Formula
\[\text{DecoderLayer}(x, \text{encoder\_output}) = x + \text{FFN}(\text{LayerNorm}(x + \text{EncDecAttention}(\text{LayerNorm}(x + \text{MaskedSelfAttention}(\text{LayerNorm}(x))))))\]
Three Sublayers:
- MaskedSelfAttention: Causal self-attention
- EncDecAttention: Attention to encoder output
- FFN: Feed-forward network
Detailed Examples
Example: Generating Translation
Translating "The cat" → "Le chat" (French):
Step 1: Encoder Processes Source
- Encoder reads: "The cat"
- Creates rich representations for both words
- Output: Encoder representations ready for decoder
Step 2: Decoder Generates First Word
- Input: [START] token
- Masked self-attention: Only sees [START] (causal)
- Encoder-decoder attention: Looks at "The cat" from encoder
- FFN: Processes combined information
- Output: "Le" (first word generated)
Step 3: Decoder Generates Second Word
- Input: [START] "Le"
- Masked self-attention: Can see [START] and "Le" (but not future)
- Encoder-decoder attention: Still looks at "The cat"
- FFN: Processes information
- Output: "chat" (second word generated)
Implementation
Causal Masking Implementation
import numpy as np
def create_causal_mask(seq_len):
"""
Create causal mask for decoder
Parameters:
seq_len: Sequence length
Returns:
Mask matrix (seq_len, seq_len) where 1 = allowed, 0 = masked
"""
mask = np.triu(np.ones((seq_len, seq_len)), k=1)
# Invert: 1 = allowed, 0 = masked
mask = 1 - mask
return mask
def apply_causal_mask(scores, mask):
"""
Apply causal mask to attention scores
Parameters:
scores: Attention scores (batch, seq_len, seq_len)
mask: Causal mask (seq_len, seq_len)
Returns:
Masked scores
"""
# Set masked positions to -infinity
masked_scores = scores + (1 - mask) * (-1e9)
return masked_scores
# Example: Causal mask for sequence of length 4
seq_len = 4
causal_mask = create_causal_mask(seq_len)
print("Causal Mask:")
print(causal_mask)
# Output:
# [[1. 0. 0. 0.]
# [1. 1. 0. 0.]
# [1. 1. 1. 0.]
# [1. 1. 1. 1.]]
Decoder Layer Implementation
import numpy as np
class DecoderLayer:
"""Single decoder layer"""
def __init__(self, d_model, num_heads, d_ff):
self.d_model = d_model
self.num_heads = num_heads
self.d_ff = d_ff
# Layer normalizations (3 sublayers = 3 layer norms)
self.layer_norm1 = LayerNormalization(d_model)
self.layer_norm2 = LayerNormalization(d_model)
self.layer_norm3 = LayerNormalization(d_model)
def forward(self, x, encoder_output, masked_attention_fn, encdec_attention_fn, ffn_fn):
"""
Forward pass through decoder layer
Parameters:
x: Decoder input (batch, seq_len, d_model)
encoder_output: Encoder output (batch, enc_seq_len, d_model)
masked_attention_fn: Function for masked self-attention
encdec_attention_fn: Function for encoder-decoder attention
ffn_fn: Function for FFN
"""
# Sublayer 1: Masked self-attention
x_norm1 = self.layer_norm1.forward(x)
masked_attn_output = masked_attention_fn(x_norm1)
x = x + masked_attn_output
# Sublayer 2: Encoder-decoder attention
x_norm2 = self.layer_norm2.forward(x)
encdec_attn_output = encdec_attention_fn(x_norm2, encoder_output)
x = x + encdec_attn_output
# Sublayer 3: FFN
x_norm3 = self.layer_norm3.forward(x)
ffn_output = ffn_fn(x_norm3)
x = x + ffn_output
return x
# Example usage
d_model, num_heads, d_ff = 512, 8, 2048
decoder_layer = DecoderLayer(d_model, num_heads, d_ff)
# Decoder input: (batch=2, dec_seq_len=5, d_model=512)
decoder_input = np.random.randn(2, 5, 512)
# Encoder output: (batch=2, enc_seq_len=10, d_model=512)
encoder_output = np.random.randn(2, 10, 512)
print(f"Decoder layer created")
print(f"Decoder input shape: {decoder_input.shape}")
print(f"Encoder output shape: {encoder_output.shape}")
Real-World Applications
Decoder-Only Models
Decoder architecture is used in generation models:
1. GPT Models (Decoder-Only)
- GPT-1, GPT-2, GPT-3, GPT-4
- Use decoder layers (without encoder-decoder attention)
- Autoregressive text generation
- Revolutionary for language modeling
2. Machine Translation
- Encoder-decoder architecture
- Encoder processes source language
- Decoder generates target language
- State-of-the-art translation systems
3. Text Summarization
- Encoder reads long document
- Decoder generates summary
- Autoregressive generation
Causal Masking Importance
Why causal masking is critical:
- Prevents cheating: Model can't use future information
- Realistic generation: Mimics how humans generate text
- Training consistency: Training and inference match
- Without masking: Model would learn to "cheat" and fail at inference
Test Your Understanding
Question 1: How many sublayers does each decoder layer have?
A) 1
B) 2
C) 3 (Masked self-attention, encoder-decoder attention, FFN)
D) 4
Question 2: What is the purpose of causal masking in the decoder?
A) To prevent the decoder from seeing future tokens during generation
B) To reduce computation
C) To normalize activations
D) To add non-linearity
Question 3: What is encoder-decoder attention used for?
A) To allow the decoder to attend to encoder output (source information)
B) To normalize the encoder output
C) To generate tokens
D) To mask future positions
Question 4: How does the decoder generate output autoregressively?
A) Decoder generates tokens one at a time, using previously generated tokens as input, with causal masking ensuring it only sees past tokens, continuing until end token or max length
B) All tokens at once
C) Random tokens
D) Only first token
Question 5: What is the difference between decoder self-attention and cross-attention?
A) Self-attention attends to decoder's own previous tokens (with causal mask), while cross-attention lets decoder queries attend to encoder outputs, allowing decoder to focus on relevant input parts
B) They're the same
C) Self-attention uses encoder
D) Cross-attention is self-attention
Question 6: Why is causal masking necessary in decoder?
A) Causal masking prevents the model from cheating during training by seeing future tokens, ensuring it learns to generate based only on past context, which matches inference behavior
B) It's not necessary
C) To make it faster
D) To reduce memory
Question 7: What are the main components of a decoder layer?
A) Masked multi-head self-attention, cross-attention (encoder-decoder attention), feed-forward network, residual connections, and layer normalization
B) Only self-attention
C) Only cross-attention
D) Random components
Question 8: How does decoder-only architecture differ from encoder-decoder?
A) Decoder-only (like GPT) uses only decoder layers with causal masking, suitable for generation tasks. Encoder-decoder uses both encoder and decoder, better for tasks requiring understanding input and generating output
B) They're the same
C) Decoder-only has encoder
D) No difference
Question 9: How would you implement causal masking?
A) Create a lower triangular mask matrix (1s for allowed positions, -inf for masked), add to attention scores before softmax. This ensures positions can only attend to themselves and previous positions
B) Just mask all
C) No masking needed
D) Random masking
Question 10: What is teacher forcing in decoder training?
A) During training, decoder receives ground truth previous tokens as input instead of its own predictions, making training faster and more stable, though this creates train-test mismatch
B) It uses predictions
C) It's not used
D) Only in inference
Question 11: How does the decoder use encoder information via cross-attention?
A) Decoder queries attend to encoder keys and values, computing attention weights that determine which encoder positions are most relevant for generating each decoder token, creating a weighted combination of encoder outputs
B) It copies encoder directly
C) It ignores encoder
D) It averages encoder
Question 12: What tasks are decoder-only models best suited for?
A) Generation tasks like text completion, story generation, code generation, where you generate text from scratch or continue given context, without needing to understand structured input
B) Only classification
C) Only translation
D) All tasks equally