Chapter 8: Training Tips & Best Practices

Practical strategies for successful neural network training

Learning Objectives

  • Understand learning rate selection and scheduling
  • Master regularization techniques (dropout, L2, early stopping)
  • Learn different optimization algorithms
  • Understand how to monitor training effectively
  • Develop a systematic approach to training neural networks
  • Recognize and fix common training problems

Training Neural Networks Successfully

The Art and Science of Training

Training neural networks requires balancing many hyperparameters and techniques. This chapter covers practical strategies that make the difference between a network that learns and one that doesn't.

Key Areas:

  • Learning Rate: Most critical hyperparameter
  • Regularization: Prevent overfitting
  • Optimization: Better algorithms than basic gradient descent
  • Monitoring: Know when to stop, what to adjust

Learning Rate Selection

The Most Important Hyperparameter

The learning rate controls how big steps we take during optimization. Too large: overshoot minimum. Too small: takes forever or gets stuck.

Learning Rate in Gradient Descent

W ← W - η × (∂L/∂W)

Where η (eta) is the learning rate

Typical Values:
  • Too Large (> 0.1): Loss explodes, training unstable
  • Good Range: 0.001 to 0.01 (common starting point)
  • Too Small (< 0.0001): Training very slow, may not converge

Learning Rate Schedules

Common strategies:

  • Fixed: Same rate throughout (simple but suboptimal)
  • Step Decay: Reduce by factor every N epochs
  • Exponential Decay: η_t = η₀ × decay^t
  • Cosine Annealing: Smooth decrease following cosine curve

Learning Rate Finder

import numpy as np
import matplotlib.pyplot as plt

def find_learning_rate(model, train_data, start_lr=1e-8, end_lr=1.0, num_iterations=100):
    """
    Learning rate range test
    
    Strategy: Train with exponentially increasing learning rates,
    plot loss vs learning rate to find optimal range
    """
    learning_rates = np.logspace(np.log10(start_lr), np.log10(end_lr), num_iterations)
    losses = []
    
    for lr in learning_rates:
        # Train for a few iterations with this learning rate
        loss = train_with_lr(model, train_data, lr, iterations=10)
        losses.append(loss)
    
    # Plot to find optimal range
    # Optimal: steepest downward slope in loss curve
    return learning_rates, losses

# Best practice: Start with learning rate finder, then use schedule
# Typical: Start at 10x lower than where loss starts increasing

Regularization Techniques

🛡️ Preventing Overfitting

Regularization techniques prevent neural networks from memorizing training data and help them generalize to new data.

1. Dropout

Dropout Formula

During Training:
h_drop = h ⊙ mask / (1 - p)

During Inference:
h_drop = h × (1 - p)

Where p is dropout probability and mask is random binary vector

How It Works:
  • Randomly set some neurons to zero during training
  • Forces network to not rely on specific neurons
  • At test time, scale outputs by (1-p)
  • Common values: p = 0.5 for hidden layers, p = 0.2 for input layer

2. L2 Regularization (Weight Decay)

L2 Regularization

L_total = L_data + λ × Σ||W||²

Where λ (lambda) is the regularization strength

Effect:
  • Penalizes large weights
  • Encourages simpler models
  • Prevents overfitting
  • Typical λ: 0.0001 to 0.01

3. Early Stopping

Stop training when validation loss stops improving.

  • Monitor validation loss during training
  • If no improvement for N epochs (patience), stop
  • Use best model (lowest validation loss)
  • Prevents overfitting by stopping before memorization

Optimization Algorithms

🚀 Beyond Basic Gradient Descent

Modern optimizers use adaptive learning rates and momentum to train faster and more reliably.

1. Momentum

Momentum Update

v_t = β × v_{t-1} + (1 - β) × ∇L
W ← W - η × v_t

Where β is momentum coefficient (typically 0.9)

Benefits:
  • Accumulates gradient over time (like momentum in physics)
  • Smooths out noisy gradients
  • Faster convergence, especially in narrow valleys

2. Adam (Adaptive Moment Estimation)

Adam Algorithm

m_t = β₁ × m_{t-1} + (1 - β₁) × ∇L (first moment)
v_t = β₂ × v_{t-1} + (1 - β₂) × (∇L)² (second moment)
m̂_t = m_t / (1 - β₁^t) (bias correction)
v̂_t = v_t / (1 - β₂^t) (bias correction)
W ← W - η × m̂_t / (√v̂_t + ε)
Default Parameters:
  • β₁ = 0.9: Momentum decay
  • β₂ = 0.999: Variance decay
  • ε = 1e-8: Small constant (prevents division by zero)
  • η = 0.001: Learning rate (often works well as-is)

Adam Optimizer Implementation

import numpy as np

class AdamOptimizer:
    """Adam (Adaptive Moment Estimation) Optimizer"""
    
    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.lr = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.t = 0  # Time step
        
        # Per-parameter moments
        self.m = {}  # First moment
        self.v = {}  # Second moment
    
    def update(self, params, grads):
        """Update parameters using Adam"""
        self.t += 1
        
        for key in params.keys():
            # Initialize moments if needed
            if key not in self.m:
                self.m[key] = np.zeros_like(params[key])
                self.v[key] = np.zeros_like(params[key])
            
            # Update biased first moment
            self.m[key] = self.beta1 * self.m[key] + (1 - self.beta1) * grads[key]
            
            # Update biased second moment
            self.v[key] = self.beta2 * self.v[key] + (1 - self.beta2) * (grads[key]**2)
            
            # Bias correction
            m_hat = self.m[key] / (1 - self.beta1**self.t)
            v_hat = self.v[key] / (1 - self.beta2**self.t)
            
            # Update parameters
            params[key] -= self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)
        
        return params

# Usage
optimizer = AdamOptimizer(learning_rate=0.001)
# Use in training loop: params = optimizer.update(params, grads)

Monitoring Training

What to Watch

Effective monitoring helps you understand what's happening during training and when to make adjustments.

📈 Key Metrics to Monitor

Metric What It Tells You Good Sign Bad Sign
Training Loss How well model fits training data Decreasing smoothly Not decreasing, or NaN
Validation Loss Generalization ability Decreasing, close to training loss Increasing while training decreases (overfitting)
Accuracy Classification performance Increasing Stuck or decreasing
Gradient Norm Training health Reasonable values (0.1-10) Very small (vanishing) or very large (exploding)

Training Checklist

✅ Systematic Approach

Follow this checklist for successful training:

Pre-Training Checklist

  • [ ] Data is properly preprocessed and normalized
  • [ ] Train/validation/test splits are appropriate
  • [ ] Network architecture is suitable for the task
  • [ ] Weights are properly initialized (He/Xavier)
  • [ ] Learning rate is reasonable (start with 0.001)
  • [ ] Loss function is appropriate for the task

During Training Checklist

  • [ ] Monitor training and validation loss
  • [ ] Check for overfitting (validation loss increasing)
  • [ ] Watch for vanishing/exploding gradients
  • [ ] Save best model (lowest validation loss)
  • [ ] Use early stopping if validation not improving
  • [ ] Adjust learning rate if loss not decreasing

Post-Training Checklist

  • [ ] Evaluate on test set (only once!)
  • [ ] Compare train/val/test performance
  • [ ] Check for overfitting or underfitting
  • [ ] Document hyperparameters and results

Test Your Understanding

Question 1: What is the most critical hyperparameter in neural network training?

A) Learning rate
B) Number of layers
C) Batch size
D) Activation function

Question 2: What does dropout do during training?

A) Randomly sets some neurons to zero to prevent overfitting
B) Removes layers from the network
C) Increases learning rate
D) Stops training early

Question 3: What is early stopping?

A) Stopping training when loss reaches zero
B) Stopping training when validation loss stops improving
C) Using a smaller learning rate
D) Removing regularization

Question 4: How do you choose the right learning rate?

A) Start with a reasonable value (like 0.001), monitor loss curve, if loss decreases slowly increase it, if loss oscillates or increases decrease it. Use learning rate scheduling or adaptive optimizers
B) Always use 0.1
C) Use the largest possible
D) Random value

Question 5: What is the purpose of batch normalization?

A) Batch normalization normalizes layer inputs by subtracting mean and dividing by standard deviation, stabilizing training, allowing higher learning rates, and reducing internal covariate shift
B) To increase batch size
C) To decrease computation
D) To add noise

Question 6: How does dropout prevent overfitting?

A) Dropout randomly sets some neurons to zero during training, preventing the network from relying too heavily on specific neurons and forcing it to learn more robust features
B) By removing layers
C) By using less data
D) By increasing parameters

Question 7: What is the difference between training loss and validation loss?

A) Training loss measures error on data used for training, validation loss measures error on held-out data. Large gap indicates overfitting, both high indicates underfitting
B) They're always the same
C) Validation is always lower
D) No difference

Question 8: How do you debug a network that's not learning?

A) Check gradient flow (should not be zero), verify data preprocessing, check loss function, inspect weight initialization, verify learning rate, check for bugs in forward/backward pass, ensure data is being fed correctly
B) Just wait longer
C) Add more layers
D) Use more data

Question 9: What is gradient clipping and when do you use it?

A) Gradient clipping caps gradient magnitude to prevent exploding gradients, especially useful in RNNs and deep networks where gradients can grow exponentially
B) To increase gradients
C) To remove gradients
D) Not needed

Question 10: How do you choose the right optimizer?

A) Adam is a good default (adaptive learning rate, works well for most cases). SGD with momentum for fine-tuning. RMSprop for RNNs. Try different optimizers and compare validation performance
B) Always use SGD
C) Random choice
D) They're all the same

Question 11: What is the purpose of a validation set?

A) Validation set is used to tune hyperparameters and monitor training progress without touching the test set, helping detect overfitting and guide model selection
B) For final testing
C) For training
D) Not needed

Question 12: How would you improve a network that's overfitting?

A) Add dropout, use more data or data augmentation, reduce model capacity, add regularization (L1/L2), use early stopping, simplify architecture, reduce training time
B) Add more layers
C) Increase learning rate
D) Use less data