Learning Objectives

By the end of this chapter, you will be able to:

Explain the transformer component behind Decoder Architecture.
Trace how Decoder Architecture contributes to sequence modeling.
Recognize the implementation trade-offs behind transformer architectures.

Chapter 8: Decoder Architecture

Generating Sequences

Decoder Architecture

What is the Decoder?

The decoder generates output sequences one token at a time, using both the encoder's output and previously generated tokens. Unlike the encoder which processes the entire input at once, the decoder generates autoregressively (left-to-right).

Think of the decoder like writing a translation:

Encoder: Reads and understands the entire source sentence
Decoder: Writes the translation word by word, using both the source understanding and what it has written so far
Key constraint: Can only see previous words when generating (causal masking)

Decoder vs Encoder

Encoder (Bidirectional)

Processes entire input at once
Can attend to all positions (left and right)
Used for understanding tasks
Example: BERT, classification models

Decoder (Causal/Autoregressive)

Generates one token at a time
Can only attend to previous positions (causal)
Used for generation tasks
Example: GPT, translation, summarization

📚 Real-World Analogy: Writing a Story

Imagine writing a story based on a prompt:

Encoder: Reads and understands the prompt completely
Decoder Step 1: Generates first word using prompt understanding
Decoder Step 2: Generates second word using prompt + first word
Decoder Step 3: Generates third word using prompt + first two words
Result: Complete story generated word by word

Key Concepts

🔑 Decoder Layer Components

Each decoder layer has THREE sublayers (vs encoder's two):

1. Masked Multi-Head Self-Attention

Self-attention over decoder's own output
Causal masking: Can only attend to previous positions
Prevents "cheating" by looking at future tokens
Like reading a book - you can only see pages you've already read

2. Encoder-Decoder Attention

Attention from decoder to encoder output
Query from decoder, Key/Value from encoder
Allows decoder to "look back" at source
Like a translator looking at the source text while writing

3. Feed-Forward Network

Same as encoder FFN
Processes information at each position
Adds non-linearity and capacity

🚫 Causal Masking Explained

Causal masking ensures the decoder can't see future tokens:

Example: Generating "Hello world"

When generating position 1 ("Hello"):

Can attend to: position 0 (start token)
Cannot attend to: position 2 ("world") - hasn't been generated yet!

When generating position 2 ("world"):

Can attend to: position 0 (start), position 1 ("Hello")
Cannot attend to: future positions

Mask Matrix

For sequence of length 3:

Attention Mask:
[1  0  0]  ← Position 0 can only see itself
[1  1  0]  ← Position 1 can see 0 and 1
[1  1  1]  ← Position 2 can see 0, 1, and 2

0 = masked (cannot attend), 1 = allowed

Mathematical Formulations

Causal Masking

\[\text{Mask}[i, j] = \begin{cases} 1 & \text{if } j \leq i \\ 0 & \text{if } j > i \end{cases}\]

Meaning:

Position i can attend to position j only if j ≤ i
Prevents attending to future positions
Applied before softmax in attention

Masked Attention Scores

\[\text{scores}_{\text{masked}} = \text{scores} + \text{mask} \times (-\infty)\]

\[\text{attention} = \text{softmax}(\text{scores}_{\text{masked}})\]

How It Works:

Add -∞ to masked positions
After softmax: masked positions → 0 (exp(-∞) = 0)
Result: Cannot attend to future positions

Decoder Layer Formula

\[\text{DecoderLayer}(x, \text{encoder\_output}) = x + \text{FFN}(\text{LayerNorm}(x + \text{EncDecAttention}(\text{LayerNorm}(x + \text{MaskedSelfAttention}(\text{LayerNorm}(x))))))\]

Three Sublayers:

MaskedSelfAttention: Causal self-attention
EncDecAttention: Attention to encoder output
FFN: Feed-forward network

Detailed Examples

Example: Generating Translation

Translating "The cat" → "Le chat" (French):

Step 1: Encoder Processes Source

Encoder reads: "The cat"
Creates rich representations for both words
Output: Encoder representations ready for decoder

Step 2: Decoder Generates First Word

Input: [START] token
Masked self-attention: Only sees [START] (causal)
Encoder-decoder attention: Looks at "The cat" from encoder
FFN: Processes combined information
Output: "Le" (first word generated)

Step 3: Decoder Generates Second Word

Input: [START] "Le"
Masked self-attention: Can see [START] and "Le" (but not future)
Encoder-decoder attention: Still looks at "The cat"
FFN: Processes information
Output: "chat" (second word generated)

Implementation

Causal Masking Implementation

import numpy as np

def create_causal_mask(seq_len):
    """
    Create causal mask for decoder
    
    Parameters:
    seq_len: Sequence length
    
    Returns:
    Mask matrix (seq_len, seq_len) where 1 = allowed, 0 = masked
    """
    mask = np.triu(np.ones((seq_len, seq_len)), k=1)
    # Invert: 1 = allowed, 0 = masked
    mask = 1 - mask
    return mask

def apply_causal_mask(scores, mask):
    """
    Apply causal mask to attention scores
    
    Parameters:
    scores: Attention scores (batch, seq_len, seq_len)
    mask: Causal mask (seq_len, seq_len)
    
    Returns:
    Masked scores
    """
    # Set masked positions to -infinity
    masked_scores = scores + (1 - mask) * (-1e9)
    return masked_scores

# Example: Causal mask for sequence of length 4
seq_len = 4
causal_mask = create_causal_mask(seq_len)
print("Causal Mask:")
print(causal_mask)
# Output:
# [[1. 0. 0. 0.]
#  [1. 1. 0. 0.]
#  [1. 1. 1. 0.]
#  [1. 1. 1. 1.]]

Decoder Layer Implementation

import numpy as np

class DecoderLayer:
    """Single decoder layer"""
    
    def __init__(self, d_model, num_heads, d_ff):
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_ff = d_ff
        
        # Layer normalizations (3 sublayers = 3 layer norms)
        self.layer_norm1 = LayerNormalization(d_model)
        self.layer_norm2 = LayerNormalization(d_model)
        self.layer_norm3 = LayerNormalization(d_model)
    
    def forward(self, x, encoder_output, masked_attention_fn, encdec_attention_fn, ffn_fn):
        """
        Forward pass through decoder layer
        
        Parameters:
        x: Decoder input (batch, seq_len, d_model)
        encoder_output: Encoder output (batch, enc_seq_len, d_model)
        masked_attention_fn: Function for masked self-attention
        encdec_attention_fn: Function for encoder-decoder attention
        ffn_fn: Function for FFN
        """
        # Sublayer 1: Masked self-attention
        x_norm1 = self.layer_norm1.forward(x)
        masked_attn_output = masked_attention_fn(x_norm1)
        x = x + masked_attn_output
        
        # Sublayer 2: Encoder-decoder attention
        x_norm2 = self.layer_norm2.forward(x)
        encdec_attn_output = encdec_attention_fn(x_norm2, encoder_output)
        x = x + encdec_attn_output
        
        # Sublayer 3: FFN
        x_norm3 = self.layer_norm3.forward(x)
        ffn_output = ffn_fn(x_norm3)
        x = x + ffn_output
        
        return x

# Example usage
d_model, num_heads, d_ff = 512, 8, 2048
decoder_layer = DecoderLayer(d_model, num_heads, d_ff)

# Decoder input: (batch=2, dec_seq_len=5, d_model=512)
decoder_input = np.random.randn(2, 5, 512)

# Encoder output: (batch=2, enc_seq_len=10, d_model=512)
encoder_output = np.random.randn(2, 10, 512)

print(f"Decoder layer created")
print(f"Decoder input shape: {decoder_input.shape}")
print(f"Encoder output shape: {encoder_output.shape}")

Real-World Applications

Decoder-Only Models

Decoder architecture is used in generation models:

1. GPT Models (Decoder-Only)

GPT-1, GPT-2, GPT-3, GPT-4
Use decoder layers (without encoder-decoder attention)
Autoregressive text generation
Revolutionary for language modeling

2. Machine Translation

Encoder-decoder architecture
Encoder processes source language
Decoder generates target language
State-of-the-art translation systems

3. Text Summarization

Encoder reads long document
Decoder generates summary
Autoregressive generation

Causal Masking Importance

Why causal masking is critical:

Prevents cheating: Model can't use future information
Realistic generation: Mimics how humans generate text
Training consistency: Training and inference match
Without masking: Model would learn to "cheat" and fail at inference

Test Your Understanding

Question 1: How many sublayers does each decoder layer have?

A) 1

B) 2

C) 3 (Masked self-attention, encoder-decoder attention, FFN)

D) 4

Question 2: What is the purpose of causal masking in the decoder?

A) To prevent the decoder from seeing future tokens during generation

B) To reduce computation

C) To normalize activations

D) To add non-linearity

Question 3: What is encoder-decoder attention used for?

A) To allow the decoder to attend to encoder output (source information)

B) To normalize the encoder output

C) To generate tokens

D) To mask future positions

Question 4: How does the decoder generate output autoregressively?

A) Decoder generates tokens one at a time, using previously generated tokens as input, with causal masking ensuring it only sees past tokens, continuing until end token or max length

B) All tokens at once

C) Random tokens

D) Only first token

Question 5: What is the difference between decoder self-attention and cross-attention?

A) Self-attention attends to decoder's own previous tokens (with causal mask), while cross-attention lets decoder queries attend to encoder outputs, allowing decoder to focus on relevant input parts

B) They're the same

C) Self-attention uses encoder

D) Cross-attention is self-attention

Question 6: Why is causal masking necessary in decoder?

A) Causal masking prevents the model from cheating during training by seeing future tokens, ensuring it learns to generate based only on past context, which matches inference behavior

B) It's not necessary

C) To make it faster

D) To reduce memory

Question 7: What are the main components of a decoder layer?

A) Masked multi-head self-attention, cross-attention (encoder-decoder attention), feed-forward network, residual connections, and layer normalization

B) Only self-attention

C) Only cross-attention

D) Random components

Question 8: How does decoder-only architecture differ from encoder-decoder?

A) Decoder-only (like GPT) uses only decoder layers with causal masking, suitable for generation tasks. Encoder-decoder uses both encoder and decoder, better for tasks requiring understanding input and generating output

B) They're the same

C) Decoder-only has encoder

D) No difference

Question 9: How would you implement causal masking?

A) Create a lower triangular mask matrix (1s for allowed positions, -inf for masked), add to attention scores before softmax. This ensures positions can only attend to themselves and previous positions

B) Just mask all

C) No masking needed

D) Random masking

Question 10: What is teacher forcing in decoder training?

A) During training, decoder receives ground truth previous tokens as input instead of its own predictions, making training faster and more stable, though this creates train-test mismatch

B) It uses predictions

C) It's not used

D) Only in inference

Question 11: How does the decoder use encoder information via cross-attention?

A) Decoder queries attend to encoder keys and values, computing attention weights that determine which encoder positions are most relevant for generating each decoder token, creating a weighted combination of encoder outputs

B) It copies encoder directly

C) It ignores encoder

D) It averages encoder

Question 12: What tasks are decoder-only models best suited for?

A) Generation tasks like text completion, story generation, code generation, where you generate text from scratch or continue given context, without needing to understand structured input

B) Only classification

C) Only translation

D) All tasks equally