Chapter 8: Training Tips & Best Practices
Practical strategies for successful neural network training
Learning Objectives
- Understand learning rate selection and scheduling
- Master regularization techniques (dropout, L2, early stopping)
- Learn different optimization algorithms
- Understand how to monitor training effectively
- Develop a systematic approach to training neural networks
- Recognize and fix common training problems
Training Neural Networks Successfully
The Art and Science of Training
Training neural networks requires balancing many hyperparameters and techniques. This chapter covers practical strategies that make the difference between a network that learns and one that doesn't.
Key Areas:
- Learning Rate: Most critical hyperparameter
- Regularization: Prevent overfitting
- Optimization: Better algorithms than basic gradient descent
- Monitoring: Know when to stop, what to adjust
Learning Rate Selection
The Most Important Hyperparameter
The learning rate controls how big steps we take during optimization. Too large: overshoot minimum. Too small: takes forever or gets stuck.
Learning Rate in Gradient Descent
Where η (eta) is the learning rate
Typical Values:
- Too Large (> 0.1): Loss explodes, training unstable
- Good Range: 0.001 to 0.01 (common starting point)
- Too Small (< 0.0001): Training very slow, may not converge
Learning Rate Schedules
Common strategies:
- Fixed: Same rate throughout (simple but suboptimal)
- Step Decay: Reduce by factor every N epochs
- Exponential Decay: η_t = η₀ × decay^t
- Cosine Annealing: Smooth decrease following cosine curve
Learning Rate Finder
import numpy as np
import matplotlib.pyplot as plt
def find_learning_rate(model, train_data, start_lr=1e-8, end_lr=1.0, num_iterations=100):
"""
Learning rate range test
Strategy: Train with exponentially increasing learning rates,
plot loss vs learning rate to find optimal range
"""
learning_rates = np.logspace(np.log10(start_lr), np.log10(end_lr), num_iterations)
losses = []
for lr in learning_rates:
# Train for a few iterations with this learning rate
loss = train_with_lr(model, train_data, lr, iterations=10)
losses.append(loss)
# Plot to find optimal range
# Optimal: steepest downward slope in loss curve
return learning_rates, losses
# Best practice: Start with learning rate finder, then use schedule
# Typical: Start at 10x lower than where loss starts increasing
Regularization Techniques
🛡️ Preventing Overfitting
Regularization techniques prevent neural networks from memorizing training data and help them generalize to new data.
1. Dropout
Dropout Formula
h_drop = h ⊙ mask / (1 - p)
During Inference:
h_drop = h × (1 - p)
Where p is dropout probability and mask is random binary vector
How It Works:
- Randomly set some neurons to zero during training
- Forces network to not rely on specific neurons
- At test time, scale outputs by (1-p)
- Common values: p = 0.5 for hidden layers, p = 0.2 for input layer
2. L2 Regularization (Weight Decay)
L2 Regularization
Where λ (lambda) is the regularization strength
Effect:
- Penalizes large weights
- Encourages simpler models
- Prevents overfitting
- Typical λ: 0.0001 to 0.01
3. Early Stopping
Stop training when validation loss stops improving.
- Monitor validation loss during training
- If no improvement for N epochs (patience), stop
- Use best model (lowest validation loss)
- Prevents overfitting by stopping before memorization
Optimization Algorithms
🚀 Beyond Basic Gradient Descent
Modern optimizers use adaptive learning rates and momentum to train faster and more reliably.
1. Momentum
Momentum Update
W ← W - η × v_t
Where β is momentum coefficient (typically 0.9)
Benefits:
- Accumulates gradient over time (like momentum in physics)
- Smooths out noisy gradients
- Faster convergence, especially in narrow valleys
2. Adam (Adaptive Moment Estimation)
Adam Algorithm
v_t = β₂ × v_{t-1} + (1 - β₂) × (∇L)² (second moment)
m̂_t = m_t / (1 - β₁^t) (bias correction)
v̂_t = v_t / (1 - β₂^t) (bias correction)
W ← W - η × m̂_t / (√v̂_t + ε)
Default Parameters:
- β₁ = 0.9: Momentum decay
- β₂ = 0.999: Variance decay
- ε = 1e-8: Small constant (prevents division by zero)
- η = 0.001: Learning rate (often works well as-is)
Adam Optimizer Implementation
import numpy as np
class AdamOptimizer:
"""Adam (Adaptive Moment Estimation) Optimizer"""
def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
self.lr = learning_rate
self.beta1 = beta1
self.beta2 = beta2
self.epsilon = epsilon
self.t = 0 # Time step
# Per-parameter moments
self.m = {} # First moment
self.v = {} # Second moment
def update(self, params, grads):
"""Update parameters using Adam"""
self.t += 1
for key in params.keys():
# Initialize moments if needed
if key not in self.m:
self.m[key] = np.zeros_like(params[key])
self.v[key] = np.zeros_like(params[key])
# Update biased first moment
self.m[key] = self.beta1 * self.m[key] + (1 - self.beta1) * grads[key]
# Update biased second moment
self.v[key] = self.beta2 * self.v[key] + (1 - self.beta2) * (grads[key]**2)
# Bias correction
m_hat = self.m[key] / (1 - self.beta1**self.t)
v_hat = self.v[key] / (1 - self.beta2**self.t)
# Update parameters
params[key] -= self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)
return params
# Usage
optimizer = AdamOptimizer(learning_rate=0.001)
# Use in training loop: params = optimizer.update(params, grads)
Monitoring Training
What to Watch
Effective monitoring helps you understand what's happening during training and when to make adjustments.
📈 Key Metrics to Monitor
| Metric | What It Tells You | Good Sign | Bad Sign |
|---|---|---|---|
| Training Loss | How well model fits training data | Decreasing smoothly | Not decreasing, or NaN |
| Validation Loss | Generalization ability | Decreasing, close to training loss | Increasing while training decreases (overfitting) |
| Accuracy | Classification performance | Increasing | Stuck or decreasing |
| Gradient Norm | Training health | Reasonable values (0.1-10) | Very small (vanishing) or very large (exploding) |
Training Checklist
✅ Systematic Approach
Follow this checklist for successful training:
Pre-Training Checklist
- [ ] Data is properly preprocessed and normalized
- [ ] Train/validation/test splits are appropriate
- [ ] Network architecture is suitable for the task
- [ ] Weights are properly initialized (He/Xavier)
- [ ] Learning rate is reasonable (start with 0.001)
- [ ] Loss function is appropriate for the task
During Training Checklist
- [ ] Monitor training and validation loss
- [ ] Check for overfitting (validation loss increasing)
- [ ] Watch for vanishing/exploding gradients
- [ ] Save best model (lowest validation loss)
- [ ] Use early stopping if validation not improving
- [ ] Adjust learning rate if loss not decreasing
Post-Training Checklist
- [ ] Evaluate on test set (only once!)
- [ ] Compare train/val/test performance
- [ ] Check for overfitting or underfitting
- [ ] Document hyperparameters and results