Chapter 8: Training Tips & Best Practices

Practical strategies for successful neural network training

Learning Objectives

Understand learning rate selection and scheduling
Master regularization techniques (dropout, L2, early stopping)
Learn different optimization algorithms
Understand how to monitor training effectively
Develop a systematic approach to training neural networks
Recognize and fix common training problems

Training Neural Networks Successfully

The Art and Science of Training

Training neural networks requires balancing many hyperparameters and techniques. This chapter covers practical strategies that make the difference between a network that learns and one that doesn't.

Key Areas:

Learning Rate: Most critical hyperparameter
Regularization: Prevent overfitting
Optimization: Better algorithms than basic gradient descent
Monitoring: Know when to stop, what to adjust

Learning Rate Selection

The Most Important Hyperparameter

The learning rate controls how big steps we take during optimization. Too large: overshoot minimum. Too small: takes forever or gets stuck.

Learning Rate in Gradient Descent

\[W ← W - \\eta \\times (\\partialL/\\partialW)\]

Where η (eta) is the learning rate

Typical Values:

Too Large (> 0.1): Loss explodes, training unstable
Good Range: 0.001 to 0.01 (common starting point)
Too Small (< 0.0001): Training very slow, may not converge

Learning Rate Schedules

Common strategies:

Fixed: Same rate throughout (simple but suboptimal)
Step Decay: Reduce by factor every N epochs
Exponential Decay: η_t = η₀ × decay^t
Cosine Annealing: Smooth decrease following cosine curve

Learning Rate Finder

import numpy as np
import matplotlib.pyplot as plt

def find_learning_rate(model, train_data, start_lr=1e-8, end_lr=1.0, num_iterations=100):
    """
    Learning rate range test
    
    Strategy: Train with exponentially increasing learning rates,
    plot loss vs learning rate to find optimal range
    """
    learning_rates = np.logspace(np.log10(start_lr), np.log10(end_lr), num_iterations)
    losses = []
    
    for lr in learning_rates:
        # Train for a few iterations with this learning rate
        loss = train_with_lr(model, train_data, lr, iterations=10)
        losses.append(loss)
    
    # Plot to find optimal range
    # Optimal: steepest downward slope in loss curve
    return learning_rates, losses

# Best practice: Start with learning rate finder, then use schedule
# Typical: Start at 10x lower than where loss starts increasing

Regularization Techniques

🛡️ Preventing Overfitting

Regularization techniques prevent neural networks from memorizing training data and help them generalize to new data.

1. Dropout

Dropout Formula

\[During Training: \\ h_drop = h ⊙ mask / (1 - p) \\ \\ During Inference: \\ h_drop = h \\times (1 - p)\]

Where p is dropout probability and mask is random binary vector

How It Works:

Randomly set some neurons to zero during training
Forces network to not rely on specific neurons
At test time, scale outputs by (1-p)
Common values: p = 0.5 for hidden layers, p = 0.2 for input layer

2. L2 Regularization (Weight Decay)

L2 Regularization

\[L_{\text{total}} = L_{\text{data}} + \lambda \times \sum ||W||^2\]

Where λ (lambda) is the regularization strength

Effect:

Penalizes large weights
Encourages simpler models
Prevents overfitting
Typical λ: 0.0001 to 0.01

3. Early Stopping

Stop training when validation loss stops improving.

Monitor validation loss during training
If no improvement for N epochs (patience), stop
Use best model (lowest validation loss)
Prevents overfitting by stopping before memorization

Optimization Algorithms

🚀 Beyond Basic Gradient Descent

Modern optimizers use adaptive learning rates and momentum to train faster and more reliably.

1. Momentum

Momentum Update

\[v_t = \\beta \\times v_{t-1} + (1 - \\beta) \\times \\nablaL \\ W ← W - \\eta \\times v_t\]

Where β is momentum coefficient (typically 0.9)

Benefits:

Accumulates gradient over time (like momentum in physics)
Smooths out noisy gradients
Faster convergence, especially in narrow valleys

2. Adam (Adaptive Moment Estimation)

Adam Algorithm

\[\begin{aligned} m_t &= \beta_1 m_{t-1} + (1 - \beta_1) \nabla L \quad (\text{first moment}) \\ v_t &= \beta_2 v_{t-1} + (1 - \beta_2) (\nabla L)^2 \quad (\text{second moment}) \\ \hat{m}_t &= \frac{m_t}{1 - \beta_1^t} \quad (\text{bias correction}) \\ \hat{v}_t &= \frac{v_t}{1 - \beta_2^t} \quad (\text{bias correction}) \\ W &\leftarrow W - \eta \times \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} \end{aligned}\]

Default Parameters:

β₁ = 0.9: Momentum decay
β₂ = 0.999: Variance decay
ε = 1e-8: Small constant (prevents division by zero)
η = 0.001: Learning rate (often works well as-is)

Adam Optimizer Implementation

import numpy as np

class AdamOptimizer:
    """Adam (Adaptive Moment Estimation) Optimizer"""
    
    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.lr = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.t = 0  # Time step
        
        # Per-parameter moments
        self.m = {}  # First moment
        self.v = {}  # Second moment
    
    def update(self, params, grads):
        """Update parameters using Adam"""
        self.t += 1
        
        for key in params.keys():
            # Initialize moments if needed
            if key not in self.m:
                self.m[key] = np.zeros_like(params[key])
                self.v[key] = np.zeros_like(params[key])
            
            # Update biased first moment
            self.m[key] = self.beta1 * self.m[key] + (1 - self.beta1) * grads[key]
            
            # Update biased second moment
            self.v[key] = self.beta2 * self.v[key] + (1 - self.beta2) * (grads[key]**2)
            
            # Bias correction
            m_hat = self.m[key] / (1 - self.beta1**self.t)
            v_hat = self.v[key] / (1 - self.beta2**self.t)
            
            # Update parameters
            params[key] -= self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)
        
        return params

# Usage
optimizer = AdamOptimizer(learning_rate=0.001)
# Use in training loop: params = optimizer.update(params, grads)

Monitoring Training

What to Watch

Effective monitoring helps you understand what's happening during training and when to make adjustments.

📈 Key Metrics to Monitor

Metric	What It Tells You	Good Sign	Bad Sign
Training Loss	How well model fits training data	Decreasing smoothly	Not decreasing, or NaN
Validation Loss	Generalization ability	Decreasing, close to training loss	Increasing while training decreases (overfitting)
Accuracy	Classification performance	Increasing	Stuck or decreasing
Gradient Norm	Training health	Reasonable values (0.1-10)	Very small (vanishing) or very large (exploding)

Training Checklist

✅ Systematic Approach

Follow this checklist for successful training:

Pre-Training Checklist

[ ] Data is properly preprocessed and normalized
[ ] Train/validation/test splits are appropriate
[ ] Network architecture is suitable for the task
[ ] Weights are properly initialized (He/Xavier)
[ ] Learning rate is reasonable (start with 0.001)
[ ] Loss function is appropriate for the task

During Training Checklist

[ ] Monitor training and validation loss
[ ] Check for overfitting (validation loss increasing)
[ ] Watch for vanishing/exploding gradients
[ ] Save best model (lowest validation loss)
[ ] Use early stopping if validation not improving
[ ] Adjust learning rate if loss not decreasing

Post-Training Checklist

[ ] Evaluate on test set (only once!)
[ ] Compare train/val/test performance
[ ] Check for overfitting or underfitting
[ ] Document hyperparameters and results

Test Your Understanding

Question 1: What is the most critical hyperparameter in neural network training?

A) Learning rate

B) Number of layers

C) Batch size

D) Activation function

Question 2: What does dropout do during training?

A) Randomly sets some neurons to zero to prevent overfitting

B) Removes layers from the network

C) Increases learning rate

D) Stops training early

Question 3: What is early stopping?

A) Stopping training when loss reaches zero

B) Stopping training when validation loss stops improving

C) Using a smaller learning rate

D) Removing regularization

Question 4: How do you choose the right learning rate?

A) Start with a reasonable value (like 0.001), monitor loss curve, if loss decreases slowly increase it, if loss oscillates or increases decrease it. Use learning rate scheduling or adaptive optimizers

B) Always use 0.1

C) Use the largest possible

D) Random value

Question 5: What is the purpose of batch normalization?

A) Batch normalization normalizes layer inputs by subtracting mean and dividing by standard deviation, stabilizing training, allowing higher learning rates, and reducing internal covariate shift

B) To increase batch size

C) To decrease computation

D) To add noise

Question 6: How does dropout prevent overfitting?

A) Dropout randomly sets some neurons to zero during training, preventing the network from relying too heavily on specific neurons and forcing it to learn more robust features

B) By removing layers

C) By using less data

D) By increasing parameters

Question 7: What is the difference between training loss and validation loss?

A) Training loss measures error on data used for training, validation loss measures error on held-out data. Large gap indicates overfitting, both high indicates underfitting

B) They're always the same

C) Validation is always lower

D) No difference

Question 8: How do you debug a network that's not learning?

A) Check gradient flow (should not be zero), verify data preprocessing, check loss function, inspect weight initialization, verify learning rate, check for bugs in forward/backward pass, ensure data is being fed correctly

B) Just wait longer

C) Add more layers

D) Use more data

Question 9: What is gradient clipping and when do you use it?

A) Gradient clipping caps gradient magnitude to prevent exploding gradients, especially useful in RNNs and deep networks where gradients can grow exponentially

B) To increase gradients

C) To remove gradients

D) Not needed

Question 10: How do you choose the right optimizer?

A) Adam is a good default (adaptive learning rate, works well for most cases). SGD with momentum for fine-tuning. RMSprop for RNNs. Try different optimizers and compare validation performance

B) Always use SGD

C) Random choice

D) They're all the same

Question 11: What is the purpose of a validation set?

A) Validation set is used to tune hyperparameters and monitor training progress without touching the test set, helping detect overfitting and guide model selection

B) For final testing

C) For training

D) Not needed

Question 12: How would you improve a network that's overfitting?

A) Add dropout, use more data or data augmentation, reduce model capacity, add regularization (L1/L2), use early stopping, simplify architecture, reduce training time

B) Add more layers

C) Increase learning rate

D) Use less data