Learning Objectives

By the end of this chapter, you will be able to:

Explain the neural-network mechanism behind Backpropagation Algorithm.
Connect Backpropagation Algorithm to training behavior and model performance.
Recognize implementation choices that affect stability and generalization.

Chapter 4: Backpropagation Algorithm

The Learning Mechanism - Understanding how neural networks learn by computing gradients and updating weights

What is Backpropagation?

The Learning Algorithm

Backpropagation (backward propagation of errors) is the algorithm that enables neural networks to learn. It computes gradients of the loss function with respect to each weight and bias, allowing the network to update its parameters and improve predictions.

The Process:

Forward Pass: Compute predictions using current weights
Compute Loss: Measure how wrong predictions are
Backward Pass: Compute gradients (how to change weights)
Update Weights: Adjust weights to reduce loss
Repeat: Continue until loss is minimized

📚 Real-World Analogy: Learning to Throw

Think of learning to throw a ball into a basket:

Forward Pass: You throw the ball (make a prediction)
Loss: You see how far off you were (measure error)
Backpropagation: You figure out what to adjust (angle, force, etc.)
Update: You adjust your throwing technique
Repeat: Keep practicing until you get it right!

Key Insight: Backpropagation tells you exactly how to adjust each "muscle" (weight) to improve your "throw" (prediction)!

Why "Backward"?

Backpropagation works backward through the network because:

Error starts at output: We know the error at the final layer
Propagate backward: We compute how much each previous layer contributed to the error
Chain rule: Errors flow backward through the network using calculus
Efficient: One backward pass computes all gradients at once

The Chain Rule: Foundation of Backpropagation

🔗 Mathematical Foundation

The chain rule from calculus is the mathematical foundation of backpropagation. It allows us to compute derivatives of composite functions, which is exactly what neural networks are!

Chain Rule for Single Variable

If y = f(g(x)), then:

dy/dx = (df/dg) × (dg/dx)

Intuition:

To find how y changes with x, we need to know:

How y changes with g (df/dg)
How g changes with x (dg/dx)

Multiply them together!

Simple Example

Example: y = (2x + 1)³

Let g = 2x + 1, so y = g³

Using chain rule:

dy/dg = 3g² = 3(2x + 1)²
dg/dx = 2
dy/dx = (dy/dg) × (dg/dx) = 3(2x + 1)² × 2 = 6(2x + 1)²

Verification: Direct differentiation gives the same result!

Chain Rule for Multiple Variables

For neural networks, we need the multivariable chain rule:

If z = f(x, y) where x = g(t) and y = h(t), then:

dz/dt = (∂f/∂x)(dx/dt) + (∂f/∂y)(dy/dt)

In Neural Networks:

Each layer's output depends on previous layers
Error propagates backward through all paths
We sum contributions from all paths (sum rule)

Chain Rule Visualization

import numpy as np

# Example: Computing gradient through a simple network
# y = f(g(h(x)))
# where h(x) = x², g(u) = 2u + 1, f(v) = v³

def h(x):
    return x**2

def g(u):
    return 2*u + 1

def f(v):
    return v**3

# Forward pass
x = 3
u = h(x)  # u = 9
v = g(u)  # v = 19
y = f(v)  # y = 6859

# Backward pass (chain rule)
df_dv = 3 * v**2  # = 3 * 19² = 1083
dg_du = 2
dh_dx = 2 * x     # = 2 * 3 = 6

# Chain rule: dy/dx = (df/dv) × (dg/du) × (dh/dx)
dy_dx = df_dv * dg_du * dh_dx
print(f"dy/dx = {dy_dx}")

# Verification: y = (2x² + 1)³
# dy/dx = 3(2x² + 1)² × 4x = 12x(2x² + 1)²
# At x=3: 12×3×(2×9+1)² = 36×19² = 12996 ✓

Loss Functions: Measuring Error

📏 What is a Loss Function?

A loss function measures how wrong our predictions are. It quantifies the difference between predicted and actual values. The goal of training is to minimize this loss.

Common Loss Functions

1. Mean Squared Error (MSE) - For Regression

L = (1/n) Σ_i=1ⁿ (y_pred - y_true)²

Properties:

Always positive: Squared terms ensure L ≥ 0
Penalizes large errors: Squaring amplifies big mistakes
Differentiable: Smooth curve, easy to optimize
Best for: Regression tasks (predicting continuous values)

2. Cross-Entropy Loss - For Classification

Binary Cross-Entropy:

L = -(1/n) Σ_i=1ⁿ [ y_true log(y_pred) + (1 - y_true) log(1 - y_pred) ]

Multi-class Cross-Entropy:

L = -(1/n) Σ_i=1ⁿ Σ_c=1^C y_true,c log(y_pred,c)

Why Cross-Entropy?

Works with probabilities: Designed for classification
Large penalty for confident wrong predictions: If y_pred ≈ 0 but y_true = 1, loss → ∞
Small penalty for correct predictions: If y_pred ≈ y_true, loss → 0
Best for: Classification tasks

Loss Function Implementation

import numpy as np

def mse_loss(y_pred, y_true):
    """
    Mean Squared Error

    Parameters:
    y_pred: Predicted values (n_samples,)
    y_true: True values (n_samples,)
    """
    return np.mean((y_pred - y_true)**2)

def mse_loss_derivative(y_pred, y_true):
    """Derivative of MSE with respect to y_pred"""
    return 2 * (y_pred - y_true) / len(y_pred)

def binary_cross_entropy(y_pred, y_true, epsilon=1e-15):
    """
    Binary Cross-Entropy Loss

    Parameters:
    y_pred: Predicted probabilities (n_samples,)
    y_true: True labels (0 or 1) (n_samples,)
    epsilon: Small value to prevent log(0)
    """
    # Clip to prevent log(0)
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

def binary_cross_entropy_derivative(y_pred, y_true, epsilon=1e-15):
    """Derivative of BCE with respect to y_pred"""
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -(y_true / y_pred - (1 - y_true) / (1 - y_pred)) / len(y_pred)

def cross_entropy(y_pred, y_true, epsilon=1e-15):
    """
    Multi-class Cross-Entropy Loss

    Parameters:
    y_pred: Predicted probabilities (n_samples, n_classes)
    y_true: True labels (one-hot encoded) (n_samples, n_classes)
    """
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(np.sum(y_true * np.log(y_pred), axis=1))

# Example usage
y_pred_reg = np.array([2.1, 3.5, 1.8])
y_true_reg = np.array([2.0, 3.0, 2.0])
print(f"MSE: {mse_loss(y_pred_reg, y_true_reg):.4f}")

y_pred_cls = np.array([0.9, 0.1, 0.8])
y_true_cls = np.array([1, 0, 1])
print(f"BCE: {binary_cross_entropy(y_pred_cls, y_true_cls):.4f}")

The Backward Pass: Computing Gradients

⬅️ Working Backward Through Layers

The backward pass computes gradients starting from the output layer and moving backward to the input layer. We use the chain rule to propagate errors through each layer.

Step-by-Step Backward Pass

1. Output Layer Gradient

Start with the loss gradient at the output:

∂L/∂a⁽L⁾ = derivative of loss with respect to output

For MSE Loss:

L = (1/2)(y_pred - y_true)²
∂L/∂y_pred = y_pred - y_true (error signal!)

2. Gradient Through Activation

Apply chain rule through activation function:

∂L/∂z⁽L⁾ = (∂L/∂a⁽L⁾) × f′(z⁽L⁾)

Example with Sigmoid:

If a = σ(z), then ∂a/∂z = σ(z)(1 - σ(z))
∂L/∂z = (∂L/∂a) × σ(z)(1 - σ(z))

3. Gradient for Weights and Biases

Compute gradients for parameters:

∂L/∂W⁽L⁾ = (∂L/∂z⁽L⁾) × (a⁽L⁻¹⁾)ᵀ
∂L/∂b⁽L⁾ = Σ ∂L/∂z⁽L⁾

Intuition:

Weight gradient: How much did each input contribute to the error?
Bias gradient: Sum of errors (bias affects all outputs equally)

4. Propagate to Previous Layer

Compute error for previous layer:

∂L/∂a⁽L⁻¹⁾ = (W⁽L⁾)ᵀ × (∂L/∂z⁽L⁾)

This becomes the error signal for layer L-1:

Weights transpose: "distribute" error back to previous layer
This error signal is used to compute gradients for layer L-1
Process repeats for all layers

Complete Example: 2-Layer Network

Network: 2 inputs → 3 hidden → 1 output

Forward Pass (already computed):

a⁽⁰⁾ = [0.5, 0.8] (input)
z⁽¹⁾ = [0.39, 0.62, 0.85] (hidden pre-activation)
a⁽¹⁾ = [0.39, 0.62, 0.85] (hidden activation, using ReLU)
z⁽²⁾ = 1.076 (output pre-activation)
a⁽²⁾ = 0.746 (output, using sigmoid)

True value: y_true = 1.0

Backward Pass:

Output layer error: ∂L/∂a⁽²⁾ = a⁽²⁾ - y_true = 0.746 - 1.0 = -0.254
Through sigmoid: ∂L/∂z⁽²⁾ = -0.254 × 0.746(1-0.746) = -0.254 × 0.190 = -0.048
Weight gradient: ∂L/∂W⁽²⁾ = -0.048 × a⁽¹⁾ = -0.048 × [0.39, 0.62, 0.85]
Bias gradient: ∂L/∂b⁽²⁾ = -0.048
Propagate to hidden: ∂L/∂a⁽¹⁾ = W⁽²⁾ᵀ × (-0.048) = [0.4, 0.5, 0.6]ᵀ × (-0.048)
Continue for layer 1...

Gradient Computation: The Math

Complete Gradient Formulas

For a network with L layers:

Output Layer (L):

δ⁽L⁾ = (∂L/∂a⁽L⁾) ⊙ f′⁽L⁾(z⁽L⁾)

Where ⊙ is element-wise multiplication

Hidden Layers (l = L-1, L-2, ..., 1):

δ⁽ˡ⁾ = ((W⁽ˡ⁺¹⁾)ᵀ × δ⁽ˡ⁺¹⁾) ⊙ f′⁽ˡ⁾(z⁽ˡ⁾)

Weight Gradients:

∂L/∂W⁽ˡ⁾ = δ⁽ˡ⁾ × (a⁽ˡ⁻¹⁾)ᵀ

Bias Gradients:

∂L/∂b⁽ˡ⁾ = δ⁽ˡ⁾ (sum over batch dimension)

Notation:

δ⁽ˡ⁾: Error signal (gradient w.r.t. pre-activation) for layer l
f'⁽ˡ⁾: Derivative of activation function for layer l
⊙: Element-wise (Hadamard) product

Backpropagation Implementation

import numpy as np

class NeuralNetwork:
    """Neural Network with Backpropagation"""

    def __init__(self, layer_sizes, activation='relu'):
        self.layer_sizes = layer_sizes
        self.activation = activation
        self.weights = []
        self.biases = []

        # Initialize weights (He initialization for ReLU)
        for i in range(len(layer_sizes) - 1):
            W = np.random.randn(layer_sizes[i+1], layer_sizes[i]) * np.sqrt(2.0 / layer_sizes[i])
            b = np.zeros((layer_sizes[i+1], 1))
            self.weights.append(W)
            self.biases.append(b)

    def _activate(self, z):
        """Activation function"""
        if self.activation == 'relu':
            return np.maximum(0, z)
        elif self.activation == 'sigmoid':
            z = np.clip(z, -250, 250)
            return 1 / (1 + np.exp(-z))
        return z

    def _activate_derivative(self, z):
        """Derivative of activation function"""
        if self.activation == 'relu':
            return (z > 0).astype(float)
        elif self.activation == 'sigmoid':
            s = self._activate(z)
            return s * (1 - s)
        return np.ones_like(z)

    def forward(self, X):
        """Forward propagation"""
        activations = [X]
        z_values = []
        current = X

        for W, b in zip(self.weights, self.biases):
            z = np.dot(W, current) + b
            z_values.append(z)
            a = self._activate(z)
            activations.append(a)
            current = a

        return activations, z_values

    def backward(self, X, y_true, activations, z_values):
        """
        Backpropagation

        Returns gradients for weights and biases
        """
        m = X.shape[1]  # batch size
        grads_W = []
        grads_b = []

        # Output layer error
        a_L = activations[-1]
        delta = a_L - y_true  # For MSE loss

        # Backpropagate through layers
        for l in range(len(self.weights) - 1, -1, -1):
            # Gradient through activation
            delta = delta * self._activate_derivative(z_values[l])

            # Weight and bias gradients
            grad_W = np.dot(delta, activations[l].T) / m
            grad_b = np.sum(delta, axis=1, keepdims=True) / m

            grads_W.insert(0, grad_W)
            grads_b.insert(0, grad_b)

            # Propagate error to previous layer (if not first layer)
            if l > 0:
                delta = np.dot(self.weights[l].T, delta)

        return grads_W, grads_b

    def update_weights(self, grads_W, grads_b, learning_rate=0.01):
        """Update weights using gradients"""
        for i in range(len(self.weights)):
            self.weights[i] -= learning_rate * grads_W[i]
            self.biases[i] -= learning_rate * grads_b[i]

    def train_step(self, X, y_true, learning_rate=0.01):
        """One training step: forward + backward + update"""
        activations, z_values = self.forward(X)
        grads_W, grads_b = self.backward(X, y_true, activations, z_values)
        self.update_weights(grads_W, grads_b, learning_rate)

        # Return loss
        loss = np.mean((activations[-1] - y_true)**2)
        return loss

# Example usage
network = NeuralNetwork([2, 3, 1], activation='sigmoid')
X = np.array([[0.5, 0.8]]).T
y = np.array([[1.0]])

# Training loop
for epoch in range(100):
    loss = network.train_step(X, y, learning_rate=0.1)
    if epoch % 20 == 0:
        print(f"Epoch {epoch}, Loss: {loss:.4f}")

Weight Updates: Gradient Descent

⬇️ Minimizing the Loss

Once we have gradients, we update weights to reduce the loss. We move in the direction opposite to the gradient (gradient descent).

Gradient Descent Update Rule

W ← W - η × (∂L/∂W)
b ← b - η × (∂L/∂b)

Where η (eta) is the learning rate

Why Subtract?

Gradient points uphill: Direction of steepest increase
We want to go downhill: Direction of steepest decrease
Solution: Move in opposite direction (subtract gradient)

📚 Analogy: Finding the Bottom of a Valley

Imagine you're blindfolded in a valley and want to reach the bottom:

Loss function: Height of the valley (we want to minimize it)
Gradient: Direction of steepest uphill (we feel the slope)
Gradient descent: Walk in opposite direction (downhill)
Learning rate: Step size (how big steps you take)

Too small steps: Takes forever to reach bottom

Too large steps: Might overshoot and miss the bottom

Complete Training Loop

import numpy as np

def train_network(network, X_train, y_train, epochs=1000, learning_rate=0.01):
    """
    Complete training loop

    Parameters:
    network: NeuralNetwork instance
    X_train: Training inputs (n_features, n_samples)
    y_train: Training targets (n_outputs, n_samples)
    epochs: Number of training iterations
    learning_rate: Step size for gradient descent
    """
    losses = []

    for epoch in range(epochs):
        # Forward pass
        activations, z_values = network.forward(X_train)

        # Compute loss
        loss = np.mean((activations[-1] - y_train)**2)
        losses.append(loss)

        # Backward pass
        grads_W, grads_b = network.backward(X_train, y_train, activations, z_values)

        # Update weights
        network.update_weights(grads_W, grads_b, learning_rate)

        # Print progress
        if epoch % 100 == 0:
            print(f"Epoch {epoch}, Loss: {loss:.6f}")

    return losses

# Example: Learn XOR function
X = np.array([[0, 0, 1, 1],
              [0, 1, 0, 1]]).T.T  # (2, 4)
y = np.array([[0, 1, 1, 0]])  # (1, 4) - XOR output

network = NeuralNetwork([2, 4, 1], activation='sigmoid')
losses = train_network(network, X, y, epochs=1000, learning_rate=0.5)

# Test
predictions = network.forward(X)[0][-1]
print("\nPredictions:", predictions.flatten())
print("True values:", y.flatten())

Complete Implementation

Full Backpropagation Implementation

import numpy as np

class FeedforwardNetwork:
    """Complete Feedforward Network with Backpropagation"""

    def __init__(self, layer_sizes, activation='relu', init_method='he'):
        self.layer_sizes = layer_sizes
        self.activation = activation
        self.weights = []
        self.biases = []

        # Initialize
        for i in range(len(layer_sizes) - 1):
            n_in, n_out = layer_sizes[i], layer_sizes[i+1]
            if init_method == 'he':
                W = np.random.randn(n_out, n_in) * np.sqrt(2.0 / n_in)
            else:
                W = np.random.randn(n_out, n_in) * 0.01
            b = np.zeros((n_out, 1))
            self.weights.append(W)
            self.biases.append(b)

    def _activate(self, z):
        if self.activation == 'relu':
            return np.maximum(0, z)
        elif self.activation == 'sigmoid':
            z = np.clip(z, -250, 250)
            return 1 / (1 + np.exp(-z))
        return z

    def _activate_derivative(self, z):
        if self.activation == 'relu':
            return (z > 0).astype(float)
        elif self.activation == 'sigmoid':
            s = self._activate(z)
            return s * (1 - s)
        return np.ones_like(z)

    def forward(self, X):
        """Forward propagation"""
        activations = [X]
        z_values = []
        current = X

        for W, b in zip(self.weights, self.biases):
            z = np.dot(W, current) + b
            z_values.append(z)
            a = self._activate(z)
            activations.append(a)
            current = a

        return activations, z_values

    def backward(self, X, y_true, activations, z_values, loss='mse'):
        """Backpropagation"""
        m = X.shape[1]  # batch size
        grads_W = []
        grads_b = []

        # Output layer error (depends on loss function)
        a_L = activations[-1]
        if loss == 'mse':
            delta = (a_L - y_true) / m
        elif loss == 'cross_entropy':
            # For cross-entropy with sigmoid output
            delta = (a_L - y_true) / m

        # Backpropagate
        for l in range(len(self.weights) - 1, -1, -1):
            # Gradient through activation
            delta = delta * self._activate_derivative(z_values[l])

            # Compute gradients
            grad_W = np.dot(delta, activations[l].T)
            grad_b = np.sum(delta, axis=1, keepdims=True)

            grads_W.insert(0, grad_W)
            grads_b.insert(0, grad_b)

            # Propagate to previous layer
            if l > 0:
                delta = np.dot(self.weights[l].T, delta)

        return grads_W, grads_b

    def update(self, grads_W, grads_b, learning_rate):
        """Update weights"""
        for i in range(len(self.weights)):
            self.weights[i] -= learning_rate * grads_W[i]
            self.biases[i] -= learning_rate * grads_b[i]

    def train(self, X, y, epochs=1000, learning_rate=0.01, verbose=True):
        """Training loop"""
        losses = []
        for epoch in range(epochs):
            activations, z_values = self.forward(X)
            loss = np.mean((activations[-1] - y)**2)
            losses.append(loss)

            grads_W, grads_b = self.backward(X, y, activations, z_values)
            self.update(grads_W, grads_b, learning_rate)

            if verbose and epoch % (epochs // 10) == 0:
                print(f"Epoch {epoch}: Loss = {loss:.6f}")

        return losses

# Usage example
np.random.seed(42)
network = FeedforwardNetwork([2, 4, 1], activation='sigmoid')
X = np.array([[0, 0, 1, 1], [0, 1, 0, 1]])
y = np.array([[0, 1, 1, 0]])
losses = network.train(X, y, epochs=1000, learning_rate=0.5)

Test Your Understanding

Question 1: Why is backpropagation called "backward" propagation?

A) Because it's slower than forward propagation

B) Because it computes gradients starting from the output layer and moving backward to the input

C) Because it reverses the network architecture

D) Because it's done after forward propagation

Question 2: What mathematical principle is backpropagation based on?

A) Chain rule from calculus

B) Matrix multiplication

C) Linear algebra

D) Probability theory

Question 3: In gradient descent, why do we subtract the gradient from weights?

A) To increase the loss

B) To move in the direction that decreases the loss

C) To normalize the weights

D) To prevent overfitting

Question 4: Explain how backpropagation uses the chain rule.

A) The chain rule lets us compute gradients layer by layer. We start with the output layer gradient, then multiply by each layer's derivative going backward, chaining the gradients together to get the final weight updates

B) It multiplies all gradients at once

C) It only uses the output gradient

D) It doesn't use chain rule

Question 5: What happens if the learning rate is too large during gradient descent?

A) The network overshoots the optimal weights, causing loss to increase or oscillate, and may never converge

B) Training becomes faster

C) Nothing changes

D) It prevents overfitting

Question 6: How do you compute the gradient for a weight in the hidden layer?

A) Multiply the error from the next layer by the activation derivative and the input to that weight, then propagate backward using chain rule

B) Use only the output error

C) Random value

D) Same as output layer

Question 7: What is the formula for updating weights using gradient descent?

A) W_new = W_old - α(∂L/∂W) where α is learning rate and L is loss

B) W_new = W_old + α

C) W_new = α × W_old

D) W_new = W_old × (∂L/∂W)

Question 8: Why do we need to store activations during forward propagation for backpropagation?

A) We need the activations to compute the derivatives of the activation functions and to multiply with gradients when applying the chain rule

B) They're not needed

C) Only for debugging

D) To save memory

Question 9: What is the difference between batch gradient descent and stochastic gradient descent?

A) Batch uses all training examples to compute gradient before updating weights, while stochastic updates weights after each single example, making it faster but noisier

B) They're the same

C) Batch is always better

D) Stochastic uses all data

Question 10: How does backpropagation handle bias gradients?

A) Bias gradients are computed similarly to weight gradients, but since bias is added (not multiplied), its gradient is just the error signal from the next layer

B) Biases don't have gradients

C) Same as weight gradients

D) Biases are not updated

Question 11: What causes vanishing gradients in deep networks?

A) When gradients become extremely small as they propagate backward through many layers, often due to activation functions with small derivatives like sigmoid, causing early layers to learn very slowly

B) Too many neurons

C) Large learning rate

D) Too much data

Question 12: How would you debug a network that's not learning during backpropagation?

A) Check gradient magnitudes (should not be zero or extremely small), verify loss is decreasing, check learning rate, inspect weight updates, ensure activations are in reasonable range, verify backprop implementation is correct

B) Increase learning rate

C) Add more layers

D) Use more data

By the end of this chapter, you will be able to:

Chapter 4: Backpropagation Algorithm

What is Backpropagation?

The Learning Algorithm

📚 Real-World Analogy: Learning to Throw

Why "Backward"?

The Chain Rule: Foundation of Backpropagation

🔗 Mathematical Foundation

Chain Rule for Single Variable

Intuition:

Simple Example

Chain Rule for Multiple Variables

In Neural Networks:

Chain Rule Visualization

Loss Functions: Measuring Error

📏 What is a Loss Function?

Common Loss Functions

1. Mean Squared Error (MSE) - For Regression

Properties:

2. Cross-Entropy Loss - For Classification

Why Cross-Entropy?

Loss Function Implementation

The Backward Pass: Computing Gradients

⬅️ Working Backward Through Layers

Step-by-Step Backward Pass

1. Output Layer Gradient

For MSE Loss:

2. Gradient Through Activation

Example with Sigmoid:

3. Gradient for Weights and Biases

Intuition:

4. Propagate to Previous Layer

This becomes the error signal for layer L-1:

Complete Example: 2-Layer Network

Gradient Computation: The Math

Complete Gradient Formulas

Notation:

Backpropagation Implementation

Weight Updates: Gradient Descent

⬇️ Minimizing the Loss

Gradient Descent Update Rule

Why Subtract?

📚 Analogy: Finding the Bottom of a Valley

Complete Training Loop

Complete Implementation

Full Backpropagation Implementation

Test Your Understanding

Question 1: Why is backpropagation called "backward" propagation?

Question 2: What mathematical principle is backpropagation based on?

Question 3: In gradient descent, why do we subtract the gradient from weights?

Question 4: Explain how backpropagation uses the chain rule.

Question 5: What happens if the learning rate is too large during gradient descent?

Question 6: How do you compute the gradient for a weight in the hidden layer?

Question 7: What is the formula for updating weights using gradient descent?

Question 8: Why do we need to store activations during forward propagation for backpropagation?

Question 9: What is the difference between batch gradient descent and stochastic gradient descent?

Question 10: How does backpropagation handle bias gradients?

Question 11: What causes vanishing gradients in deep networks?

Question 12: How would you debug a network that's not learning during backpropagation?

Search