Chapter 3: Activation Functions

Non-linearity and Network Capacity - Understanding how activation functions enable neural networks to learn complex patterns

Learning Objectives

  • Understand why activation functions are essential
  • Master sigmoid, tanh, and ReLU functions
  • Learn ReLU variants (Leaky ReLU, ELU, Swish)
  • Understand the vanishing gradient problem
  • Know when to use each activation function
  • Implement activation functions from scratch

Why Do We Need Activation Functions?

πŸ”‘ The Key to Non-Linearity

Without activation functions, neural networks would just be linear transformations! No matter how many layers you stack, a network without activations can only learn linear relationships. Activation functions introduce non-linearity, enabling networks to learn complex, non-linear patterns.

Mathematical Proof:

Consider a network without activations:

  • Layer 1: z₁ = W₁x + b₁
  • Layer 2: zβ‚‚ = Wβ‚‚z₁ + bβ‚‚ = Wβ‚‚(W₁x + b₁) + bβ‚‚ = Wβ‚‚W₁x + Wβ‚‚b₁ + bβ‚‚
  • This is just: zβ‚‚ = W'x + b' (still linear!)

Result: Multiple layers collapse into a single linear transformation!

Without Activation Functions

For L layers without activation:

\[y = W_L W_{L-1} \ldots W_1 x + \text{(bias terms)}\] \[= W'x + b' \quad \text{(single linear transformation)}\]
What This Means:
  • No matter how deep the network, it's equivalent to one layer
  • Cannot learn non-linear patterns (curves, circles, XOR, etc.)
  • Limited to linear regression capabilities

With Activation Functions

For L layers with activation f(Β·):

a₁ = f(W₁x + b₁)
aβ‚‚ = f(Wβ‚‚a₁ + bβ‚‚)
...
y = f(W_L a_{L-1} + b_L)
What This Enables:
  • Each layer applies a non-linear transformation
  • Composition of non-linear functions = complex patterns
  • Can approximate any continuous function (Universal Approximation Theorem)

πŸ“š Real-World Analogy: Building Blocks

Think of activation functions like different types of building blocks:

  • Without activations: Only straight blocks β†’ can only build straight lines
  • With activations: Curved blocks, angled blocks β†’ can build complex structures

Example: To draw a circle, you need curves. Linear transformations can only create straight lines. Activation functions provide the "curves" needed for complex shapes!

Properties of Good Activation Functions

An ideal activation function should have:

  • Non-linearity: Enables learning complex patterns
  • Differentiability: Required for backpropagation (gradient computation)
  • Bounded output: Prevents activations from exploding
  • Computational efficiency: Fast to compute (used millions of times)
  • Non-zero gradients: Avoids vanishing gradients

Sigmoid Activation Function

πŸ“ˆ The Classic Choice

The sigmoid function was one of the first activation functions used in neural networks. It squashes any input into a range between 0 and 1, making it perfect for binary classification and probability outputs.

Sigmoid Function

Οƒ(x) = 1 / (1 + e^(-x))
Properties:
  • Range: (0, 1) - outputs between 0 and 1
  • Monotonic: Always increasing
  • Smooth: Infinitely differentiable
  • S-shaped: Sigmoid curve
  • Centered at 0.5: Οƒ(0) = 0.5

Sigmoid Derivative

Critical for backpropagation:

Οƒ'(x) = Οƒ(x)(1 - Οƒ(x))
Key Insight:
  • Derivative is maximum at x = 0 (Οƒ(0) = 0.5, derivative = 0.25)
  • Derivative approaches 0 as |x| β†’ ∞
  • Problem: Vanishing gradients for large inputs!

Sigmoid Examples

Input x Οƒ(x) Οƒ'(x) Interpretation
-5 0.007 0.007 Very negative β†’ almost 0
-2 0.119 0.105 Negative
0 0.500 0.250 Neutral (maximum gradient)
2 0.881 0.105 Positive
5 0.993 0.007 Very positive β†’ almost 1

Sigmoid Implementation

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(x):
    """
    Sigmoid activation function
    
    Parameters:
    x: Input (can be scalar, vector, or matrix)
    
    Returns:
    Sigmoid of x, clipped to prevent overflow
    """
    # Clip to prevent overflow
    x_clipped = np.clip(x, -250, 250)
    return 1 / (1 + np.exp(-x_clipped))

def sigmoid_derivative(x):
    """
    Derivative of sigmoid function
    
    Uses the identity: Οƒ'(x) = Οƒ(x)(1 - Οƒ(x))
    """
    s = sigmoid(x)
    return s * (1 - s)

# Example usage
x = np.linspace(-10, 10, 100)
y = sigmoid(x)
dy = sigmoid_derivative(x)

print(f"Sigmoid(0) = {sigmoid(0):.4f}")
print(f"Sigmoid(5) = {sigmoid(5):.4f}")
print(f"Max derivative = {sigmoid_derivative(0):.4f}")

⚠️ Problems with Sigmoid

  • Vanishing Gradients: For |x| > 5, gradient β‰ˆ 0 β†’ learning stops
  • Not Zero-Centered: Output always positive β†’ gradients always same sign
  • Slow Convergence: Saturated neurons learn slowly
  • Computational Cost: Expensive exponential operation

Hyperbolic Tangent (Tanh)

Zero-Centered Alternative

Tanh is similar to sigmoid but outputs values between -1 and 1. This zero-centered property makes it often perform better than sigmoid in practice, especially in hidden layers.

Tanh Function

tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))
= 2Οƒ(2x) - 1 (related to sigmoid)
Properties:
  • Range: (-1, 1) - zero-centered!
  • Shape: Similar S-curve to sigmoid, but symmetric
  • tanh(0) = 0: Centered at origin
  • Steeper: Gradient is steeper than sigmoid

Tanh Derivative

tanh'(x) = 1 - tanhΒ²(x)
Comparison with Sigmoid:
  • Maximum gradient = 1 (at x = 0) vs sigmoid's 0.25
  • Still suffers from vanishing gradients for large |x|
  • But better than sigmoid due to zero-centered output

Tanh Implementation

import numpy as np

def tanh(x):
    """Hyperbolic tangent activation"""
    return np.tanh(x)

def tanh_derivative(x):
    """Derivative of tanh"""
    return 1 - np.tanh(x)**2

# Comparison: Sigmoid vs Tanh
x = np.array([-2, -1, 0, 1, 2])
sigmoid_vals = 1 / (1 + np.exp(-x))
tanh_vals = np.tanh(x)

print("Input:", x)
print("Sigmoid:", sigmoid_vals)
print("Tanh:   ", tanh_vals)
print("\nNote: Tanh is zero-centered, sigmoid is not!")

When to Use Tanh vs Sigmoid

Aspect Sigmoid Tanh
Output Range (0, 1) (-1, 1)
Zero-Centered No Yes βœ“
Max Gradient 0.25 1.0
Best For Output layer (probabilities) Hidden layers

ReLU (Rectified Linear Unit)

The Modern Standard

ReLU is the most popular activation function for deep neural networks today. It's simple, fast, and solves the vanishing gradient problem for positive inputs. Almost all modern deep learning architectures use ReLU or its variants.

ReLU Function

ReLU(x) = max(0, x) = { x if x > 0
0 if x ≀ 0 }
Properties:
  • Range: [0, ∞) - unbounded above
  • Simple: Just returns max(0, x)
  • Fast: No expensive exponentials
  • Sparsity: Sets negative inputs to 0 (sparse activations)
  • No Saturation: For positive x, gradient = 1 (constant!)

ReLU Derivative

ReLU'(x) = { 1 if x > 0
0 if x ≀ 0 }
Key Advantages:
  • Constant gradient: For positive inputs, gradient = 1 (no vanishing!)
  • Computational efficiency: Just a simple comparison
  • Problem: Dead ReLU problem (gradient = 0 for negative inputs)

ReLU Examples

Input x ReLU(x) ReLU'(x) Interpretation
-5 0 0 Dead neuron (no gradient)
-1 0 0 Dead neuron
0 0 0 Threshold
1 1 1 Active (full gradient)
10 10 1 Active (no saturation!)

ReLU Implementation

import numpy as np

def relu(x):
    """Rectified Linear Unit"""
    return np.maximum(0, x)

def relu_derivative(x):
    """Derivative of ReLU"""
    return (x > 0).astype(float)

# Vectorized implementation (handles arrays)
def relu_vectorized(x):
    """ReLU that works with arrays"""
    return np.where(x > 0, x, 0)

# Example
x = np.array([-2, -1, 0, 1, 2, 5])
print("Input:    ", x)
print("ReLU(x):  ", relu(x))
print("ReLU'(x): ", relu_derivative(x))

# Performance comparison
import time
large_x = np.random.randn(1000000)

start = time.time()
result1 = np.maximum(0, large_x)
time1 = time.time() - start

start = time.time()
result2 = np.where(large_x > 0, large_x, 0)
time2 = time.time() - start

print(f"\nmax(0, x) time: {time1:.6f}s")
print(f"where() time:   {time2:.6f}s")

βœ… Advantages of ReLU

  • No Vanishing Gradient (for positive inputs): Gradient = 1, constant!
  • Computational Efficiency: Just max(0, x) - very fast
  • Sparsity: Creates sparse representations (many zeros)
  • Biological Plausibility: Mimics neuron firing (threshold behavior)

⚠️ Disadvantages of ReLU

  • Dead ReLU Problem: Neurons with negative inputs never activate
  • Not Zero-Centered: Output always β‰₯ 0
  • Unbounded: Can output very large values

ReLU Variants

Solving ReLU's Problems

Several variants of ReLU have been developed to address its limitations, particularly the "dead ReLU" problem where neurons with negative inputs never activate.

1. Leaky ReLU

Leaky ReLU Formula

LeakyReLU(x) = { x if x > 0
Ξ±x if x ≀ 0 }

Where Ξ± is a small positive constant (typically 0.01)

Key Improvement:
  • Small gradient (Ξ±) for negative inputs
  • Prevents "dead" neurons
  • Allows some information flow even for negative values

Leaky ReLU Implementation

import numpy as np

def leaky_relu(x, alpha=0.01):
    """Leaky ReLU activation"""
    return np.where(x > 0, x, alpha * x)

def leaky_relu_derivative(x, alpha=0.01):
    """Derivative of Leaky ReLU"""
    return np.where(x > 0, 1, alpha)

# Comparison
x = np.array([-2, -1, 0, 1, 2])
print("Input:        ", x)
print("ReLU:         ", np.maximum(0, x))
print("Leaky ReLU:   ", leaky_relu(x))
print("Gradient ReLU:", (x > 0).astype(float))
print("Gradient LReLU:", leaky_relu_derivative(x))

2. ELU (Exponential Linear Unit)

ELU Formula

ELU(x) = { x if x > 0
Ξ±(e^x - 1) if x ≀ 0 }

Where Ξ± is typically 1.0

Advantages:
  • Smooth curve (differentiable everywhere)
  • Negative outputs (zero-centered-like behavior)
  • No dead neurons
  • Better performance than ReLU in some cases

3. Swish (Self-Gated Activation)

Swish Function

Swish(x) = x Β· Οƒ(x) = x / (1 + e^(-x))
Properties:
  • Non-monotonic: Can decrease for negative x
  • Smooth: Differentiable everywhere
  • Bounded below: Approaches 0 as x β†’ -∞
  • Unbounded above: Grows linearly as x β†’ ∞
  • Performance: Often outperforms ReLU

All ReLU Variants

import numpy as np

def relu(x):
    return np.maximum(0, x)

def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

def elu(x, alpha=1.0):
    return np.where(x > 0, x, alpha * (np.exp(x) - 1))

def swish(x):
    """Swish: x * sigmoid(x)"""
    sigmoid_x = 1 / (1 + np.exp(-np.clip(x, -250, 250)))
    return x * sigmoid_x

# Comparison
x = np.linspace(-5, 5, 100)
relu_vals = relu(x)
leaky_vals = leaky_relu(x)
elu_vals = elu(x)
swish_vals = swish(x)

print("Comparison at x = -2:")
print(f"ReLU:       {relu(-2):.4f}")
print(f"Leaky ReLU: {leaky_relu(-2):.4f}")
print(f"ELU:        {elu(-2):.4f}")
print(f"Swish:      {swish(-2):.4f}")

Activation Function Comparison

Function Range Gradient at x=0 Dead Neurons? Best For
Sigmoid (0, 1) 0.25 No (but saturates) Output layer
Tanh (-1, 1) 1.0 No (but saturates) RNNs, hidden layers
ReLU [0, ∞) 1.0 Yes (for x ≀ 0) Most deep networks
Leaky ReLU (-∞, ∞) 1.0 No When ReLU fails
ELU (-α, ∞) 1.0 No When smoothness needed
Swish (-∞, ∞) 0.5 No Modern architectures

Choosing the Right Activation Function

Decision Guide

There's no one-size-fits-all activation function. The choice depends on your network architecture, task, and layer position.

By Layer Type

Layer-Specific Recommendations

Input Layer:

  • Usually no activation (just passes data through)
  • Sometimes normalization instead

Hidden Layers:

  • ReLU: Default choice for most deep networks
  • Leaky ReLU: If you see many dead neurons
  • ELU: When you need smooth gradients
  • Swish: For modern architectures (often better than ReLU)
  • Tanh: For RNNs and LSTMs

Output Layer:

  • Binary Classification: Sigmoid (outputs probability)
  • Multi-class Classification: Softmax (outputs probability distribution)
  • Regression: Linear (no activation) or ReLU (if output β‰₯ 0)

By Task Type

Task-Specific Guidelines

Task Hidden Layers Output Layer
Image Classification ReLU / Swish Softmax
Binary Classification ReLU / Leaky ReLU Sigmoid
Regression ReLU / ELU Linear / ReLU
RNN / LSTM Tanh / Sigmoid Softmax / Linear

Activation Function Factory

import numpy as np

class ActivationFunction:
    """Factory for activation functions"""
    
    @staticmethod
    def get(name):
        """Get activation function by name"""
        activations = {
            'sigmoid': ActivationFunction.sigmoid,
            'tanh': ActivationFunction.tanh,
            'relu': ActivationFunction.relu,
            'leaky_relu': ActivationFunction.leaky_relu,
            'elu': ActivationFunction.elu,
            'swish': ActivationFunction.swish,
            'linear': ActivationFunction.linear
        }
        return activations.get(name, ActivationFunction.relu)
    
    @staticmethod
    def sigmoid(x):
        x = np.clip(x, -250, 250)
        return 1 / (1 + np.exp(-x))
    
    @staticmethod
    def tanh(x):
        return np.tanh(x)
    
    @staticmethod
    def relu(x):
        return np.maximum(0, x)
    
    @staticmethod
    def leaky_relu(x, alpha=0.01):
        return np.where(x > 0, x, alpha * x)
    
    @staticmethod
    def elu(x, alpha=1.0):
        return np.where(x > 0, x, alpha * (np.exp(x) - 1))
    
    @staticmethod
    def swish(x):
        sigmoid_x = 1 / (1 + np.exp(-np.clip(x, -250, 250)))
        return x * sigmoid_x
    
    @staticmethod
    def linear(x):
        return x

# Usage
activation = ActivationFunction.get('relu')
x = np.array([-2, -1, 0, 1, 2])
print(activation(x))

Test Your Understanding

Question 1: Why do we need activation functions in neural networks?

A) To make computation faster
B) To introduce non-linearity and enable learning complex patterns
C) To reduce memory usage
D) To prevent overfitting

Question 2: What is the main problem with sigmoid activation?

A) It's too slow
B) Vanishing gradients for large inputs
C) It outputs negative values
D) It's not differentiable

Question 3: What is the "dead ReLU" problem?

A) Neurons with negative inputs never activate and stop learning
B) ReLU is too slow
C) ReLU outputs are always zero
D) ReLU causes overfitting

Question 4: Interview question: "Compare ReLU, Leaky ReLU, and ELU activation functions."

A) ReLU: simple, fast, but has dead neuron problem. Leaky ReLU: fixes dead neurons with small negative slope. ELU: smooth, negative values, better gradient flow but computationally more expensive
B) They are all the same
C) ReLU is always better
D) ELU is always faster

Question 5: What is the mathematical formula for the sigmoid activation function?

A) \(\sigma(x) = \frac{1}{1 + e^{-x}}\)
B) \(\sigma(x) = x\)
C) \(\sigma(x) = e^x\)
D) \(\sigma(x) = \max(0, x)\)

Question 6: Interview question: "When would you use sigmoid vs softmax in the output layer?"

A) Sigmoid for binary classification (single output probability), softmax for multi-class classification (probability distribution over multiple classes that sums to 1)
B) They are interchangeable
C) Sigmoid for multi-class, softmax for binary
D) Use sigmoid always

Question 7: What is the gradient of ReLU at x = 0?

A) Technically undefined, but typically set to 0 or 1 in practice
B) Always 1
C) Always 0
D) Always 0.5

Question 8: Interview question: "How do you choose an activation function for hidden layers?"

A) ReLU is default for most cases (fast, simple). Use Leaky ReLU/ELU if dead neurons are a problem. Use tanh for RNNs. Consider Swish for modern architectures. Test multiple and choose based on validation performance
B) Always use sigmoid
C) Random selection
D) Use the same as output layer

Question 9: What is the main advantage of Swish over ReLU?

A) Swish is smooth and non-monotonic, often providing better performance, especially in deeper networks, without the dead neuron problem
B) Swish is faster
C) Swish uses less memory
D) Swish prevents overfitting

Question 10: Interview question: "Explain the vanishing gradient problem and how activation functions relate to it."

A) Vanishing gradients occur when gradients become extremely small during backpropagation. Sigmoid/tanh saturate (derivative β†’ 0) for large inputs, causing gradients to vanish. ReLU and variants help by having constant gradient (1) for positive inputs, allowing gradients to flow better
B) Only affects ReLU
C) Not related to activation functions
D) Only happens with linear activation

Question 11: What is the range of the tanh activation function?

A) (-1, 1)
B) (0, 1)
C) [0, ∞)
D) (-∞, ∞)

Question 12: Interview question: "How would you implement a custom activation function and integrate it into a neural network?"

A) Define forward function f(x), backward function f'(x) for gradient computation, ensure differentiability, test on small network, integrate into framework (PyTorch/TensorFlow) by subclassing activation module or using lambda functions
B) Just use any function
C) Only forward function needed
D) Cannot be done