Chapter 1: Introduction to Neural Networks

From Biological Neurons to Artificial Networks - Understanding the foundation of deep learning

Learning Objectives

Understand the biological inspiration behind neural networks
Master the perceptron model and its limitations
Learn the architecture of multi-layer perceptrons (MLPs)
Understand the universal approximation theorem
Implement a simple neural network from scratch
Recognize when to use neural networks vs other ML methods

Biological Inspiration: The Human Brain

🧠 The Biological Neuron

Neural networks are inspired by how the human brain works. Your brain contains approximately 86 billion neurons, each connected to thousands of other neurons through structures called synapses. When a neuron receives enough input signals, it "fires" and sends signals to connected neurons.

Key Components of a Biological Neuron:

Dendrites: Receive input signals from other neurons
Cell Body (Soma): Processes the incoming signals
Axon: Transmits output signals to other neurons
Synapses: Connections between neurons that can strengthen or weaken

From Biology to Mathematics

Artificial neural networks mimic this biological process using mathematical operations:

Biological Process → Mathematical Model

Biological Component	Mathematical Equivalent
Input signals (dendrites)	Input features x₁, x₂, ..., xₙ
Synaptic strength	Weights w₁, w₂, ..., wₙ
Neuron activation threshold	Bias term b
Neuron firing	Activation function f(·)
Output signal (axon)	Output y = f(Σw_i x_i + b)

📚 Real-World Analogy

Think of a neuron like a voting committee:

Each committee member (input feature) has a different influence (weight)
Some members' votes count more than others (higher weights)
The committee needs a minimum number of "yes" votes to make a decision (threshold/bias)
Once the threshold is reached, the committee makes a decision (activation)

Example: Deciding if you should go to a movie:

Input 1: "Is it a good movie?" (weight: 0.8 - very important)
Input 2: "Do I have time?" (weight: 0.6 - important)
Input 3: "Is it expensive?" (weight: 0.3 - less important)
Bias: -0.5 (you need enough positive signals to overcome laziness)
If weighted sum > threshold → Go to movie!

The Perceptron: The Simplest Neural Network

What is a Perceptron?

The perceptron is the simplest form of a neural network. Invented by Frank Rosenblatt in 1957, it's a single-layer neural network that can learn to classify linearly separable data.

Key Characteristics:

Takes multiple inputs (features)
Applies weights to each input
Sums the weighted inputs
Applies an activation function (typically step function)
Produces a binary output (0 or 1)

Mathematical Formulation

Perceptron Formula

Given inputs x = [x₁, x₂, ..., xₙ] and weights w = [w₁, w₂, ..., wₙ], the perceptron computes:

\[z = \\sum_{i=1}^{n} w_i x_i + b\]

Where b is the bias term. Then the output is:

\[y = f(z) = \begin{cases} 1 & \text{if } z \ge 0 \\ 0 & \text{if } z < 0 \end{cases}\]

Formula Breakdown:

z: The weighted sum (also called the "net input" or "pre-activation")
wᵢ: Weight for the i-th input feature
xᵢ: The i-th input feature value
b: Bias term (allows shifting the decision boundary)
f(·): Step function (also called Heaviside function)
y: Binary output (0 or 1)

Vectorized Form

Using linear algebra, we can write this more compactly:

\[z = w^T x + b\]

Where:

wᵀ: Transpose of weight vector (row vector)
x: Input vector (column vector)
wᵀx: Dot product (sum of element-wise multiplication)

Why Vectorization Matters:

Vectorized operations are:

Faster: Can use optimized linear algebra libraries
Cleaner: Less code, easier to read
Parallelizable: Modern CPUs/GPUs can process vectors efficiently

Concrete Example: AND Gate

Let's build a perceptron that implements an AND logic gate:

Truth Table:

x₁	x₂	Output
0	0	0
0	1	0
1	0	0
1	1	1

Solution: We need weights w₁ = 1, w₂ = 1, and bias b = -1.5

Verification:

x₁=0, x₂=0: z = 1×0 + 1×0 - 1.5 = -1.5 → y = 0 ✓
x₁=0, x₂=1: z = 1×0 + 1×1 - 1.5 = -0.5 → y = 0 ✓
x₁=1, x₂=0: z = 1×1 + 1×0 - 1.5 = -0.5 → y = 0 ✓
x₁=1, x₂=1: z = 1×1 + 1×1 - 1.5 = 0.5 → y = 1 ✓

Python Implementation

import numpy as np

class Perceptron:
    """Simple Perceptron Implementation"""
    
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = None
        self.bias = None
    
    def fit(self, X, y):
        """
        Train the perceptron
        
        Parameters:
        X: Input features (n_samples, n_features)
        y: Target labels (n_samples,)
        """
        n_samples, n_features = X.shape
        
        # Initialize weights and bias
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        # Training loop
        for _ in range(self.n_iterations):
            for idx, x_i in enumerate(X):
                # Compute linear output
                linear_output = np.dot(x_i, self.weights) + self.bias
                
                # Apply step function
                y_predicted = self.activation(linear_output)
                
                # Update rule (Perceptron Learning Rule)
                update = self.learning_rate * (y[idx] - y_predicted)
                self.weights += update * x_i
                self.bias += update
    
    def activation(self, x):
        """Step activation function"""
        return np.where(x >= 0, 1, 0)
    
    def predict(self, X):
        """Make predictions"""
        linear_output = np.dot(X, self.weights) + self.bias
        return self.activation(linear_output)

# Example usage
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 0, 0, 1])  # AND gate

perceptron = Perceptron()
perceptron.fit(X, y)

# Test
predictions = perceptron.predict(X)
print("Predictions:", predictions)
print("Weights:", perceptron.weights)
print("Bias:", perceptron.bias)

Code Explanation:

__init__: Initializes learning rate and number of iterations
fit: Trains the perceptron using the Perceptron Learning Rule
activation: Step function that outputs 1 if input ≥ 0, else 0
predict: Makes predictions on new data
Update Rule: w ← w + η(y - ŷ)x, where η is learning rate

⚠️ Limitations of Perceptron

The perceptron has a critical limitation: It can only learn linearly separable patterns. This was famously demonstrated by Marvin Minsky and Seymour Papert in 1969 with the XOR problem.

XOR Problem: The XOR (exclusive OR) function cannot be learned by a single perceptron because it's not linearly separable:

x₁	x₂	XOR Output
0	0	0
0	1	1
1	0	1
1	1	0

Why it fails: You cannot draw a single straight line to separate the 0s from the 1s. This limitation led to the development of multi-layer perceptrons (MLPs).

Multi-Layer Perceptron (MLP): Solving Complex Problems

🏗️ What is an MLP?

A Multi-Layer Perceptron (MLP) is a feedforward neural network with one or more hidden layers. Unlike the single-layer perceptron, MLPs can learn non-linear patterns and solve complex problems like the XOR problem.

Key Components:

Input Layer: Receives the input features
Hidden Layer(s): One or more layers between input and output
Output Layer: Produces the final predictions
Fully Connected: Every neuron in one layer connects to every neuron in the next

MLP Architecture

Forward Propagation in MLP

For an MLP with L layers, the forward propagation is computed as follows:

For each layer l = 1, 2, ..., L:

\[\begin{aligned} z^{(l)} &= W^{(l)} a^{(l-1)} + b^{(l)} \\n a^{(l)} &= f^{(l)}(z^{(l)}) \end{aligned}\]

Notation:

z⁽ˡ⁾: Pre-activation (weighted sum) at layer l
a⁽ˡ⁾: Activation (output) at layer l
W⁽ˡ⁾: Weight matrix for layer l
b⁽ˡ⁾: Bias vector for layer l
f⁽ˡ⁾: Activation function for layer l
a⁽⁰⁾: Input features x

Example: 2-Layer MLP for XOR

Architecture: 2 inputs → 2 hidden neurons → 1 output

Layer 1 (Hidden):

h₁ = f(w₁₁x₁ + w₁₂x₂ + b₁)
h₂ = f(w₂₁x₁ + w₂₂x₂ + b₂)

Layer 2 (Output):

y = f(w₃₁h₁ + w₃₂h₂ + b₃)

With appropriate weights, this MLP can solve XOR! The hidden layer creates non-linear combinations of inputs that make the problem linearly separable in the output layer.

MLP Implementation

import numpy as np

class MLP:
    """Multi-Layer Perceptron Implementation"""
    
    def __init__(self, layers, activation='relu'):
        """
        Initialize MLP
        
        Parameters:
        layers: List of layer sizes, e.g., [2, 4, 1] for 2 inputs, 4 hidden, 1 output
        activation: Activation function ('relu', 'sigmoid', 'tanh')
        """
        self.layers = layers
        self.activation = activation
        self.weights = []
        self.biases = []
        
        # Initialize weights and biases
        for i in range(len(layers) - 1):
            # Xavier initialization
            w = np.random.randn(layers[i], layers[i+1]) * np.sqrt(2.0 / layers[i])
            b = np.zeros((1, layers[i+1]))
            self.weights.append(w)
            self.biases.append(b)
    
    def _activate(self, x):
        """Apply activation function"""
        if self.activation == 'relu':
            return np.maximum(0, x)
        elif self.activation == 'sigmoid':
            return 1 / (1 + np.exp(-np.clip(x, -250, 250)))
        elif self.activation == 'tanh':
            return np.tanh(x)
        return x
    
    def forward(self, X):
        """Forward propagation"""
        a = X
        activations = [a]
        
        for w, b in zip(self.weights, self.biases):
            z = np.dot(a, w) + b
            a = self._activate(z)
            activations.append(a)
        
        return activations
    
    def predict(self, X):
        """Make predictions"""
        activations = self.forward(X)
        return activations[-1]

# Example: XOR problem
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])  # XOR

mlp = MLP([2, 4, 1], activation='sigmoid')
# Note: This is a simplified version. Full training requires backpropagation (covered in Chapter 4)

Neural Network Architecture Basics

🏗️ Understanding Network Structure

Neural network architecture refers to the overall design and organization of the network. This includes the number of layers, number of neurons per layer, how layers are connected, and the types of operations performed.

Key Architectural Components:

Depth: Number of layers (shallow vs deep networks)
Width: Number of neurons per layer
Connections: How neurons connect (fully connected, sparse, etc.)
Activation Functions: Non-linear transformations at each layer

Layer Types

Common Layer Types

Layer Type	Purpose	Example Use
Dense/Fully Connected	Every neuron connects to all neurons in next layer	Standard MLPs, classification
Convolutional	Sparse connections, shared weights	Image processing, CNNs
Recurrent	Connections form cycles, maintain state	Sequences, RNNs, LSTMs

Universal Approximation Theorem

The Power of Neural Networks

The Universal Approximation Theorem is a fundamental result that explains why neural networks are so powerful. It states that a feedforward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function to arbitrary accuracy, given appropriate activation functions and weights.

Mathematical Statement

For any continuous function f: [0,1]ⁿ → ℝ and any ε > 0, there exists a feedforward neural network with:

\[• One hidden layer \\ • Sufficiently many neurons \\ • Appropriate activation function (e.g., sigmoid, ReLU)\]

Such that the network approximates f with error less than ε.

What This Means:

Any function: No matter how complex, a neural network can learn it
Arbitrary accuracy: Can get as close as you want (given enough neurons)
Single hidden layer: Even shallow networks are powerful
Practical limitation: Theorem doesn't tell us how to find the weights!

📚 Real-World Implication

This theorem explains why neural networks work so well:

They can learn any pattern (given enough capacity)
No need to manually design features - the network learns them
Deep networks (multiple layers) are even more powerful
This is why deep learning has been so successful

Complete Code Example

Simple Neural Network from Scratch

import numpy as np

class SimpleNeuralNetwork:
    """A simple feedforward neural network implementation"""
    
    def __init__(self, input_size, hidden_size, output_size):
        # Initialize weights randomly
        self.W1 = np.random.randn(input_size, hidden_size) * 0.1
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * 0.1
        self.b2 = np.zeros((1, output_size))
    
    def sigmoid(self, x):
        """Sigmoid activation function"""
        return 1 / (1 + np.exp(-np.clip(x, -250, 250)))
    
    def forward(self, X):
        """Forward propagation"""
        # Layer 1
        self.z1 = np.dot(X, self.W1) + self.b1
        self.a1 = self.sigmoid(self.z1)
        
        # Layer 2 (output)
        self.z2 = np.dot(self.a1, self.W2) + self.b2
        self.a2 = self.sigmoid(self.z2)
        
        return self.a2
    
    def predict(self, X):
        """Make predictions"""
        return self.forward(X)

# Example usage
# Create network: 2 inputs → 3 hidden → 1 output
network = SimpleNeuralNetwork(input_size=2, hidden_size=3, output_size=1)

# Test input
X = np.array([[0.5, 0.8]])
output = network.predict(X)

print(f"Input: {X}")
print(f"Output: {output}")
print(f"Prediction: {'Positive' if output > 0.5 else 'Negative'}")

Code Breakdown:

__init__: Initializes weights and biases for 2-layer network
sigmoid: Activation function (clips to prevent overflow)
forward: Computes output through both layers
predict: Wrapper for making predictions

Test Your Understanding

Question 1: What is the main limitation of a single-layer perceptron?

A) It can only learn linearly separable patterns

B) It's too slow

C) It requires too much memory

D) It can't handle numerical data

Question 2: What does the Universal Approximation Theorem tell us?

A) Neural networks always find the best solution

B) A neural network can approximate any continuous function with sufficient neurons

C) Neural networks are faster than other methods

D) Neural networks don't need training

Question 3: What is the purpose of activation functions in neural networks?

A) To make computation faster

B) To introduce non-linearity and enable learning complex patterns

C) To reduce memory usage

D) To prevent overfitting

Question 4: Interview question: "Explain the difference between a perceptron and a multi-layer perceptron (MLP)."

A) A perceptron is a single-layer network that can only learn linearly separable patterns, while an MLP has multiple layers with activation functions that can learn non-linear, complex patterns

B) They are the same thing

C) Perceptron is faster

D) MLP uses less memory

Question 5: What is the mathematical representation of a neuron's output?

A) \(y = f(\sum_{i=1}^{n} w_i x_i + b)\) where f is the activation function, w_i are weights, x_i are inputs, and b is bias

B) \(y = Σ w_i x_i\)

C) \(y = w × x\)

D) \(y = x + b\)

Question 6: Interview question: "How would you initialize weights in a neural network and why?"

A) Use small random values (e.g., Xavier/Glorot or He initialization) to break symmetry, prevent vanishing/exploding gradients, and ensure different neurons learn different features

B) Initialize all weights to zero

C) Initialize all weights to one

D) Use large random values

Question 7: What does the Universal Approximation Theorem guarantee?

A) A feedforward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of R^n, given appropriate activation functions

B) Neural networks always converge to the global optimum

C) Neural networks are always better than other methods

D) Neural networks don't need training

Question 8: Interview question: "What happens if you don't use an activation function in a neural network?"

A) The network becomes a linear model, regardless of depth, because the composition of linear transformations is still linear, losing the ability to learn non-linear patterns

B) The network becomes faster

C) The network uses less memory

D) The network becomes more accurate

Question 9: What is the role of bias in a neural network?

A) Bias allows the activation function to shift, enabling the network to fit data that doesn't pass through the origin and learn more flexible decision boundaries

B) Bias makes computation faster

C) Bias prevents overfitting

D) Bias is optional and not needed

Question 10: Interview question: "How would you choose the number of neurons in a hidden layer?"

A) \(Start with a rule of thumb (e.g., between input\) and output size, or 2/3 of input size), then use validation set to tune, balancing model capacity (too few = underfitting, too many = overfitting)

B) Always use the same as input size

C) Use as many as possible

D) Use exactly 10 neurons

Question 11: What is forward propagation in a neural network?

A) The process of passing input data through the network layers, computing weighted sums and applying activation functions, to produce an output prediction

B) The process of updating weights

C) The process of calculating loss

D) The process of initializing weights

Question 12: Interview question: "What are the key components of a neural network and how do they work together?"

A) Input layer receives data, hidden layers transform data through weighted connections and activation functions, output layer produces predictions. Weights store learned patterns, biases shift activations, activation functions introduce non-linearity. Forward pass computes predictions, backpropagation updates weights based on error

B) Just weights and inputs

C) Only activation functions

D) Just the output layer