Chapter 1: Introduction to Neural Networks
From Biological Neurons to Artificial Networks - Understanding the foundation of deep learning
Learning Objectives
- Understand the biological inspiration behind neural networks
- Master the perceptron model and its limitations
- Learn the architecture of multi-layer perceptrons (MLPs)
- Understand the universal approximation theorem
- Implement a simple neural network from scratch
- Recognize when to use neural networks vs other ML methods
Biological Inspiration: The Human Brain
🧠 The Biological Neuron
Neural networks are inspired by how the human brain works. Your brain contains approximately 86 billion neurons, each connected to thousands of other neurons through structures called synapses. When a neuron receives enough input signals, it "fires" and sends signals to connected neurons.
Key Components of a Biological Neuron:
- Dendrites: Receive input signals from other neurons
- Cell Body (Soma): Processes the incoming signals
- Axon: Transmits output signals to other neurons
- Synapses: Connections between neurons that can strengthen or weaken
From Biology to Mathematics
Artificial neural networks mimic this biological process using mathematical operations:
Biological Process → Mathematical Model
| Biological Component | Mathematical Equivalent |
|---|---|
| Input signals (dendrites) | Input features x₁, x₂, ..., xₙ |
| Synaptic strength | Weights w₁, w₂, ..., wₙ |
| Neuron activation threshold | Bias term b |
| Neuron firing | Activation function f(·) |
| Output signal (axon) | Output y = f(Σwᵢxᵢ + b) |
📚 Real-World Analogy
Think of a neuron like a voting committee:
- Each committee member (input feature) has a different influence (weight)
- Some members' votes count more than others (higher weights)
- The committee needs a minimum number of "yes" votes to make a decision (threshold/bias)
- Once the threshold is reached, the committee makes a decision (activation)
Example: Deciding if you should go to a movie:
- Input 1: "Is it a good movie?" (weight: 0.8 - very important)
- Input 2: "Do I have time?" (weight: 0.6 - important)
- Input 3: "Is it expensive?" (weight: 0.3 - less important)
- Bias: -0.5 (you need enough positive signals to overcome laziness)
- If weighted sum > threshold → Go to movie!
The Perceptron: The Simplest Neural Network
What is a Perceptron?
The perceptron is the simplest form of a neural network. Invented by Frank Rosenblatt in 1957, it's a single-layer neural network that can learn to classify linearly separable data.
Key Characteristics:
- Takes multiple inputs (features)
- Applies weights to each input
- Sums the weighted inputs
- Applies an activation function (typically step function)
- Produces a binary output (0 or 1)
Mathematical Formulation
Perceptron Formula
Given inputs x = [x₁, x₂, ..., xₙ] and weights w = [w₁, w₂, ..., wₙ], the perceptron computes:
Where b is the bias term. Then the output is:
Formula Breakdown:
- z: The weighted sum (also called the "net input" or "pre-activation")
- wᵢ: Weight for the i-th input feature
- xᵢ: The i-th input feature value
- b: Bias term (allows shifting the decision boundary)
- f(·): Step function (also called Heaviside function)
- y: Binary output (0 or 1)
Vectorized Form
Using linear algebra, we can write this more compactly:
Where:
- wᵀ: Transpose of weight vector (row vector)
- x: Input vector (column vector)
- wᵀx: Dot product (sum of element-wise multiplication)
Why Vectorization Matters:
Vectorized operations are:
- Faster: Can use optimized linear algebra libraries
- Cleaner: Less code, easier to read
- Parallelizable: Modern CPUs/GPUs can process vectors efficiently
Concrete Example: AND Gate
Let's build a perceptron that implements an AND logic gate:
Truth Table:
| x₁ | x₂ | Output |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 1 |
Solution: We need weights w₁ = 1, w₂ = 1, and bias b = -1.5
Verification:
- x₁=0, x₂=0: z = 1×0 + 1×0 - 1.5 = -1.5 → y = 0 ✓
- x₁=0, x₂=1: z = 1×0 + 1×1 - 1.5 = -0.5 → y = 0 ✓
- x₁=1, x₂=0: z = 1×1 + 1×0 - 1.5 = -0.5 → y = 0 ✓
- x₁=1, x₂=1: z = 1×1 + 1×1 - 1.5 = 0.5 → y = 1 ✓
Python Implementation
import numpy as np
class Perceptron:
"""Simple Perceptron Implementation"""
def __init__(self, learning_rate=0.01, n_iterations=1000):
self.learning_rate = learning_rate
self.n_iterations = n_iterations
self.weights = None
self.bias = None
def fit(self, X, y):
"""
Train the perceptron
Parameters:
X: Input features (n_samples, n_features)
y: Target labels (n_samples,)
"""
n_samples, n_features = X.shape
# Initialize weights and bias
self.weights = np.zeros(n_features)
self.bias = 0
# Training loop
for _ in range(self.n_iterations):
for idx, x_i in enumerate(X):
# Compute linear output
linear_output = np.dot(x_i, self.weights) + self.bias
# Apply step function
y_predicted = self.activation(linear_output)
# Update rule (Perceptron Learning Rule)
update = self.learning_rate * (y[idx] - y_predicted)
self.weights += update * x_i
self.bias += update
def activation(self, x):
"""Step activation function"""
return np.where(x >= 0, 1, 0)
def predict(self, X):
"""Make predictions"""
linear_output = np.dot(X, self.weights) + self.bias
return self.activation(linear_output)
# Example usage
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 0, 0, 1]) # AND gate
perceptron = Perceptron()
perceptron.fit(X, y)
# Test
predictions = perceptron.predict(X)
print("Predictions:", predictions)
print("Weights:", perceptron.weights)
print("Bias:", perceptron.bias)
Code Explanation:
- __init__: Initializes learning rate and number of iterations
- fit: Trains the perceptron using the Perceptron Learning Rule
- activation: Step function that outputs 1 if input ≥ 0, else 0
- predict: Makes predictions on new data
- Update Rule: w ← w + η(y - ŷ)x, where η is learning rate
⚠️ Limitations of Perceptron
The perceptron has a critical limitation: It can only learn linearly separable patterns. This was famously demonstrated by Marvin Minsky and Seymour Papert in 1969 with the XOR problem.
XOR Problem: The XOR (exclusive OR) function cannot be learned by a single perceptron because it's not linearly separable:
| x₁ | x₂ | XOR Output |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
Why it fails: You cannot draw a single straight line to separate the 0s from the 1s. This limitation led to the development of multi-layer perceptrons (MLPs).
Multi-Layer Perceptron (MLP): Solving Complex Problems
🏗️ What is an MLP?
A Multi-Layer Perceptron (MLP) is a feedforward neural network with one or more hidden layers. Unlike the single-layer perceptron, MLPs can learn non-linear patterns and solve complex problems like the XOR problem.
Key Components:
- Input Layer: Receives the input features
- Hidden Layer(s): One or more layers between input and output
- Output Layer: Produces the final predictions
- Fully Connected: Every neuron in one layer connects to every neuron in the next
MLP Architecture
Forward Propagation in MLP
For an MLP with L layers, the forward propagation is computed as follows:
For each layer l = 1, 2, ..., L:
a⁽ˡ⁾ = f⁽ˡ⁾(z⁽ˡ⁾)
Notation:
- z⁽ˡ⁾: Pre-activation (weighted sum) at layer l
- a⁽ˡ⁾: Activation (output) at layer l
- W⁽ˡ⁾: Weight matrix for layer l
- b⁽ˡ⁾: Bias vector for layer l
- f⁽ˡ⁾: Activation function for layer l
- a⁽⁰⁾: Input features x
Example: 2-Layer MLP for XOR
Architecture: 2 inputs → 2 hidden neurons → 1 output
Layer 1 (Hidden):
- h₁ = f(w₁₁x₁ + w₁₂x₂ + b₁)
- h₂ = f(w₂₁x₁ + w₂₂x₂ + b₂)
Layer 2 (Output):
- y = f(w₃₁h₁ + w₃₂h₂ + b₃)
With appropriate weights, this MLP can solve XOR! The hidden layer creates non-linear combinations of inputs that make the problem linearly separable in the output layer.
MLP Implementation
import numpy as np
class MLP:
"""Multi-Layer Perceptron Implementation"""
def __init__(self, layers, activation='relu'):
"""
Initialize MLP
Parameters:
layers: List of layer sizes, e.g., [2, 4, 1] for 2 inputs, 4 hidden, 1 output
activation: Activation function ('relu', 'sigmoid', 'tanh')
"""
self.layers = layers
self.activation = activation
self.weights = []
self.biases = []
# Initialize weights and biases
for i in range(len(layers) - 1):
# Xavier initialization
w = np.random.randn(layers[i], layers[i+1]) * np.sqrt(2.0 / layers[i])
b = np.zeros((1, layers[i+1]))
self.weights.append(w)
self.biases.append(b)
def _activate(self, x):
"""Apply activation function"""
if self.activation == 'relu':
return np.maximum(0, x)
elif self.activation == 'sigmoid':
return 1 / (1 + np.exp(-np.clip(x, -250, 250)))
elif self.activation == 'tanh':
return np.tanh(x)
return x
def forward(self, X):
"""Forward propagation"""
a = X
activations = [a]
for w, b in zip(self.weights, self.biases):
z = np.dot(a, w) + b
a = self._activate(z)
activations.append(a)
return activations
def predict(self, X):
"""Make predictions"""
activations = self.forward(X)
return activations[-1]
# Example: XOR problem
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]]) # XOR
mlp = MLP([2, 4, 1], activation='sigmoid')
# Note: This is a simplified version. Full training requires backpropagation (covered in Chapter 4)
Neural Network Architecture Basics
🏗️ Understanding Network Structure
Neural network architecture refers to the overall design and organization of the network. This includes the number of layers, number of neurons per layer, how layers are connected, and the types of operations performed.
Key Architectural Components:
- Depth: Number of layers (shallow vs deep networks)
- Width: Number of neurons per layer
- Connections: How neurons connect (fully connected, sparse, etc.)
- Activation Functions: Non-linear transformations at each layer
Layer Types
Common Layer Types
| Layer Type | Purpose | Example Use |
|---|---|---|
| Dense/Fully Connected | Every neuron connects to all neurons in next layer | Standard MLPs, classification |
| Convolutional | Sparse connections, shared weights | Image processing, CNNs |
| Recurrent | Connections form cycles, maintain state | Sequences, RNNs, LSTMs |
Universal Approximation Theorem
The Power of Neural Networks
The Universal Approximation Theorem is a fundamental result that explains why neural networks are so powerful. It states that a feedforward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function to arbitrary accuracy, given appropriate activation functions and weights.
Mathematical Statement
For any continuous function f: [0,1]ⁿ → ℝ and any ε > 0, there exists a feedforward neural network with:
• Sufficiently many neurons
• Appropriate activation function (e.g., sigmoid, ReLU)
Such that the network approximates f with error less than ε.
What This Means:
- Any function: No matter how complex, a neural network can learn it
- Arbitrary accuracy: Can get as close as you want (given enough neurons)
- Single hidden layer: Even shallow networks are powerful
- Practical limitation: Theorem doesn't tell us how to find the weights!
📚 Real-World Implication
This theorem explains why neural networks work so well:
- They can learn any pattern (given enough capacity)
- No need to manually design features - the network learns them
- Deep networks (multiple layers) are even more powerful
- This is why deep learning has been so successful
Complete Code Example
Simple Neural Network from Scratch
import numpy as np
class SimpleNeuralNetwork:
"""A simple feedforward neural network implementation"""
def __init__(self, input_size, hidden_size, output_size):
# Initialize weights randomly
self.W1 = np.random.randn(input_size, hidden_size) * 0.1
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size) * 0.1
self.b2 = np.zeros((1, output_size))
def sigmoid(self, x):
"""Sigmoid activation function"""
return 1 / (1 + np.exp(-np.clip(x, -250, 250)))
def forward(self, X):
"""Forward propagation"""
# Layer 1
self.z1 = np.dot(X, self.W1) + self.b1
self.a1 = self.sigmoid(self.z1)
# Layer 2 (output)
self.z2 = np.dot(self.a1, self.W2) + self.b2
self.a2 = self.sigmoid(self.z2)
return self.a2
def predict(self, X):
"""Make predictions"""
return self.forward(X)
# Example usage
# Create network: 2 inputs → 3 hidden → 1 output
network = SimpleNeuralNetwork(input_size=2, hidden_size=3, output_size=1)
# Test input
X = np.array([[0.5, 0.8]])
output = network.predict(X)
print(f"Input: {X}")
print(f"Output: {output}")
print(f"Prediction: {'Positive' if output > 0.5 else 'Negative'}")
Code Breakdown:
- __init__: Initializes weights and biases for 2-layer network
- sigmoid: Activation function (clips to prevent overflow)
- forward: Computes output through both layers
- predict: Wrapper for making predictions