Chapter 3: Activation Functions
Non-linearity and Network Capacity - Understanding how activation functions enable neural networks to learn complex patterns
Learning Objectives
- Understand why activation functions are essential
- Master sigmoid, tanh, and ReLU functions
- Learn ReLU variants (Leaky ReLU, ELU, Swish)
- Understand the vanishing gradient problem
- Know when to use each activation function
- Implement activation functions from scratch
Why Do We Need Activation Functions?
π The Key to Non-Linearity
Without activation functions, neural networks would just be linear transformations! No matter how many layers you stack, a network without activations can only learn linear relationships. Activation functions introduce non-linearity, enabling networks to learn complex, non-linear patterns.
Mathematical Proof:
Consider a network without activations:
- Layer 1: zβ = Wβx + bβ
- Layer 2: zβ = Wβzβ + bβ = Wβ(Wβx + bβ) + bβ = WβWβx + Wβbβ + bβ
- This is just: zβ = W'x + b' (still linear!)
Result: Multiple layers collapse into a single linear transformation!
Without Activation Functions
For L layers without activation:
What This Means:
- No matter how deep the network, it's equivalent to one layer
- Cannot learn non-linear patterns (curves, circles, XOR, etc.)
- Limited to linear regression capabilities
With Activation Functions
For L layers with activation f(Β·):
aβ = f(Wβaβ + bβ)
...
y = f(W_L a_{L-1} + b_L)
What This Enables:
- Each layer applies a non-linear transformation
- Composition of non-linear functions = complex patterns
- Can approximate any continuous function (Universal Approximation Theorem)
π Real-World Analogy: Building Blocks
Think of activation functions like different types of building blocks:
- Without activations: Only straight blocks β can only build straight lines
- With activations: Curved blocks, angled blocks β can build complex structures
Example: To draw a circle, you need curves. Linear transformations can only create straight lines. Activation functions provide the "curves" needed for complex shapes!
Properties of Good Activation Functions
An ideal activation function should have:
- Non-linearity: Enables learning complex patterns
- Differentiability: Required for backpropagation (gradient computation)
- Bounded output: Prevents activations from exploding
- Computational efficiency: Fast to compute (used millions of times)
- Non-zero gradients: Avoids vanishing gradients
Sigmoid Activation Function
π The Classic Choice
The sigmoid function was one of the first activation functions used in neural networks. It squashes any input into a range between 0 and 1, making it perfect for binary classification and probability outputs.
Sigmoid Function
Properties:
- Range: (0, 1) - outputs between 0 and 1
- Monotonic: Always increasing
- Smooth: Infinitely differentiable
- S-shaped: Sigmoid curve
- Centered at 0.5: Ο(0) = 0.5
Sigmoid Derivative
Critical for backpropagation:
Key Insight:
- Derivative is maximum at x = 0 (Ο(0) = 0.5, derivative = 0.25)
- Derivative approaches 0 as |x| β β
- Problem: Vanishing gradients for large inputs!
Sigmoid Examples
| Input x | Ο(x) | Ο'(x) | Interpretation |
|---|---|---|---|
| -5 | 0.007 | 0.007 | Very negative β almost 0 |
| -2 | 0.119 | 0.105 | Negative |
| 0 | 0.500 | 0.250 | Neutral (maximum gradient) |
| 2 | 0.881 | 0.105 | Positive |
| 5 | 0.993 | 0.007 | Very positive β almost 1 |
Sigmoid Implementation
import numpy as np
import matplotlib.pyplot as plt
def sigmoid(x):
"""
Sigmoid activation function
Parameters:
x: Input (can be scalar, vector, or matrix)
Returns:
Sigmoid of x, clipped to prevent overflow
"""
# Clip to prevent overflow
x_clipped = np.clip(x, -250, 250)
return 1 / (1 + np.exp(-x_clipped))
def sigmoid_derivative(x):
"""
Derivative of sigmoid function
Uses the identity: Ο'(x) = Ο(x)(1 - Ο(x))
"""
s = sigmoid(x)
return s * (1 - s)
# Example usage
x = np.linspace(-10, 10, 100)
y = sigmoid(x)
dy = sigmoid_derivative(x)
print(f"Sigmoid(0) = {sigmoid(0):.4f}")
print(f"Sigmoid(5) = {sigmoid(5):.4f}")
print(f"Max derivative = {sigmoid_derivative(0):.4f}")
β οΈ Problems with Sigmoid
- Vanishing Gradients: For |x| > 5, gradient β 0 β learning stops
- Not Zero-Centered: Output always positive β gradients always same sign
- Slow Convergence: Saturated neurons learn slowly
- Computational Cost: Expensive exponential operation
Hyperbolic Tangent (Tanh)
Zero-Centered Alternative
Tanh is similar to sigmoid but outputs values between -1 and 1. This zero-centered property makes it often perform better than sigmoid in practice, especially in hidden layers.
Tanh Function
= 2Ο(2x) - 1 (related to sigmoid)
Properties:
- Range: (-1, 1) - zero-centered!
- Shape: Similar S-curve to sigmoid, but symmetric
- tanh(0) = 0: Centered at origin
- Steeper: Gradient is steeper than sigmoid
Tanh Derivative
Comparison with Sigmoid:
- Maximum gradient = 1 (at x = 0) vs sigmoid's 0.25
- Still suffers from vanishing gradients for large |x|
- But better than sigmoid due to zero-centered output
Tanh Implementation
import numpy as np
def tanh(x):
"""Hyperbolic tangent activation"""
return np.tanh(x)
def tanh_derivative(x):
"""Derivative of tanh"""
return 1 - np.tanh(x)**2
# Comparison: Sigmoid vs Tanh
x = np.array([-2, -1, 0, 1, 2])
sigmoid_vals = 1 / (1 + np.exp(-x))
tanh_vals = np.tanh(x)
print("Input:", x)
print("Sigmoid:", sigmoid_vals)
print("Tanh: ", tanh_vals)
print("\nNote: Tanh is zero-centered, sigmoid is not!")
When to Use Tanh vs Sigmoid
| Aspect | Sigmoid | Tanh |
|---|---|---|
| Output Range | (0, 1) | (-1, 1) |
| Zero-Centered | No | Yes β |
| Max Gradient | 0.25 | 1.0 |
| Best For | Output layer (probabilities) | Hidden layers |
ReLU (Rectified Linear Unit)
The Modern Standard
ReLU is the most popular activation function for deep neural networks today. It's simple, fast, and solves the vanishing gradient problem for positive inputs. Almost all modern deep learning architectures use ReLU or its variants.
ReLU Function
0 if x β€ 0 }
Properties:
- Range: [0, β) - unbounded above
- Simple: Just returns max(0, x)
- Fast: No expensive exponentials
- Sparsity: Sets negative inputs to 0 (sparse activations)
- No Saturation: For positive x, gradient = 1 (constant!)
ReLU Derivative
0 if x β€ 0 }
Key Advantages:
- Constant gradient: For positive inputs, gradient = 1 (no vanishing!)
- Computational efficiency: Just a simple comparison
- Problem: Dead ReLU problem (gradient = 0 for negative inputs)
ReLU Examples
| Input x | ReLU(x) | ReLU'(x) | Interpretation |
|---|---|---|---|
| -5 | 0 | 0 | Dead neuron (no gradient) |
| -1 | 0 | 0 | Dead neuron |
| 0 | 0 | 0 | Threshold |
| 1 | 1 | 1 | Active (full gradient) |
| 10 | 10 | 1 | Active (no saturation!) |
ReLU Implementation
import numpy as np
def relu(x):
"""Rectified Linear Unit"""
return np.maximum(0, x)
def relu_derivative(x):
"""Derivative of ReLU"""
return (x > 0).astype(float)
# Vectorized implementation (handles arrays)
def relu_vectorized(x):
"""ReLU that works with arrays"""
return np.where(x > 0, x, 0)
# Example
x = np.array([-2, -1, 0, 1, 2, 5])
print("Input: ", x)
print("ReLU(x): ", relu(x))
print("ReLU'(x): ", relu_derivative(x))
# Performance comparison
import time
large_x = np.random.randn(1000000)
start = time.time()
result1 = np.maximum(0, large_x)
time1 = time.time() - start
start = time.time()
result2 = np.where(large_x > 0, large_x, 0)
time2 = time.time() - start
print(f"\nmax(0, x) time: {time1:.6f}s")
print(f"where() time: {time2:.6f}s")
β Advantages of ReLU
- No Vanishing Gradient (for positive inputs): Gradient = 1, constant!
- Computational Efficiency: Just max(0, x) - very fast
- Sparsity: Creates sparse representations (many zeros)
- Biological Plausibility: Mimics neuron firing (threshold behavior)
β οΈ Disadvantages of ReLU
- Dead ReLU Problem: Neurons with negative inputs never activate
- Not Zero-Centered: Output always β₯ 0
- Unbounded: Can output very large values
ReLU Variants
Solving ReLU's Problems
Several variants of ReLU have been developed to address its limitations, particularly the "dead ReLU" problem where neurons with negative inputs never activate.
1. Leaky ReLU
Leaky ReLU Formula
Ξ±x if x β€ 0 }
Where Ξ± is a small positive constant (typically 0.01)
Key Improvement:
- Small gradient (Ξ±) for negative inputs
- Prevents "dead" neurons
- Allows some information flow even for negative values
Leaky ReLU Implementation
import numpy as np
def leaky_relu(x, alpha=0.01):
"""Leaky ReLU activation"""
return np.where(x > 0, x, alpha * x)
def leaky_relu_derivative(x, alpha=0.01):
"""Derivative of Leaky ReLU"""
return np.where(x > 0, 1, alpha)
# Comparison
x = np.array([-2, -1, 0, 1, 2])
print("Input: ", x)
print("ReLU: ", np.maximum(0, x))
print("Leaky ReLU: ", leaky_relu(x))
print("Gradient ReLU:", (x > 0).astype(float))
print("Gradient LReLU:", leaky_relu_derivative(x))
2. ELU (Exponential Linear Unit)
ELU Formula
Ξ±(e^x - 1) if x β€ 0 }
Where Ξ± is typically 1.0
Advantages:
- Smooth curve (differentiable everywhere)
- Negative outputs (zero-centered-like behavior)
- No dead neurons
- Better performance than ReLU in some cases
3. Swish (Self-Gated Activation)
Swish Function
Properties:
- Non-monotonic: Can decrease for negative x
- Smooth: Differentiable everywhere
- Bounded below: Approaches 0 as x β -β
- Unbounded above: Grows linearly as x β β
- Performance: Often outperforms ReLU
All ReLU Variants
import numpy as np
def relu(x):
return np.maximum(0, x)
def leaky_relu(x, alpha=0.01):
return np.where(x > 0, x, alpha * x)
def elu(x, alpha=1.0):
return np.where(x > 0, x, alpha * (np.exp(x) - 1))
def swish(x):
"""Swish: x * sigmoid(x)"""
sigmoid_x = 1 / (1 + np.exp(-np.clip(x, -250, 250)))
return x * sigmoid_x
# Comparison
x = np.linspace(-5, 5, 100)
relu_vals = relu(x)
leaky_vals = leaky_relu(x)
elu_vals = elu(x)
swish_vals = swish(x)
print("Comparison at x = -2:")
print(f"ReLU: {relu(-2):.4f}")
print(f"Leaky ReLU: {leaky_relu(-2):.4f}")
print(f"ELU: {elu(-2):.4f}")
print(f"Swish: {swish(-2):.4f}")
Activation Function Comparison
| Function | Range | Gradient at x=0 | Dead Neurons? | Best For |
|---|---|---|---|---|
| Sigmoid | (0, 1) | 0.25 | No (but saturates) | Output layer |
| Tanh | (-1, 1) | 1.0 | No (but saturates) | RNNs, hidden layers |
| ReLU | [0, β) | 1.0 | Yes (for x β€ 0) | Most deep networks |
| Leaky ReLU | (-β, β) | 1.0 | No | When ReLU fails |
| ELU | (-Ξ±, β) | 1.0 | No | When smoothness needed |
| Swish | (-β, β) | 0.5 | No | Modern architectures |
Choosing the Right Activation Function
Decision Guide
There's no one-size-fits-all activation function. The choice depends on your network architecture, task, and layer position.
By Layer Type
Layer-Specific Recommendations
Input Layer:
- Usually no activation (just passes data through)
- Sometimes normalization instead
Hidden Layers:
- ReLU: Default choice for most deep networks
- Leaky ReLU: If you see many dead neurons
- ELU: When you need smooth gradients
- Swish: For modern architectures (often better than ReLU)
- Tanh: For RNNs and LSTMs
Output Layer:
- Binary Classification: Sigmoid (outputs probability)
- Multi-class Classification: Softmax (outputs probability distribution)
- Regression: Linear (no activation) or ReLU (if output β₯ 0)
By Task Type
Task-Specific Guidelines
| Task | Hidden Layers | Output Layer |
|---|---|---|
| Image Classification | ReLU / Swish | Softmax |
| Binary Classification | ReLU / Leaky ReLU | Sigmoid |
| Regression | ReLU / ELU | Linear / ReLU |
| RNN / LSTM | Tanh / Sigmoid | Softmax / Linear |
Activation Function Factory
import numpy as np
class ActivationFunction:
"""Factory for activation functions"""
@staticmethod
def get(name):
"""Get activation function by name"""
activations = {
'sigmoid': ActivationFunction.sigmoid,
'tanh': ActivationFunction.tanh,
'relu': ActivationFunction.relu,
'leaky_relu': ActivationFunction.leaky_relu,
'elu': ActivationFunction.elu,
'swish': ActivationFunction.swish,
'linear': ActivationFunction.linear
}
return activations.get(name, ActivationFunction.relu)
@staticmethod
def sigmoid(x):
x = np.clip(x, -250, 250)
return 1 / (1 + np.exp(-x))
@staticmethod
def tanh(x):
return np.tanh(x)
@staticmethod
def relu(x):
return np.maximum(0, x)
@staticmethod
def leaky_relu(x, alpha=0.01):
return np.where(x > 0, x, alpha * x)
@staticmethod
def elu(x, alpha=1.0):
return np.where(x > 0, x, alpha * (np.exp(x) - 1))
@staticmethod
def swish(x):
sigmoid_x = 1 / (1 + np.exp(-np.clip(x, -250, 250)))
return x * sigmoid_x
@staticmethod
def linear(x):
return x
# Usage
activation = ActivationFunction.get('relu')
x = np.array([-2, -1, 0, 1, 2])
print(activation(x))