Chapter 2: Regression Analysis Mastery

From linear relationships to complex polynomial modeling with mathematical foundations

Learning Objectives

  • Master linear regression theory and mathematical foundations
  • Understand polynomial regression and feature engineering
  • Learn multiple regression with feature importance analysis
  • Evaluate models using proper metrics (MSE, MAE, R²)
  • Recognize and prevent overfitting in regression models
  • Apply regularization techniques (Ridge, Lasso)

What is Regression?

Core Concept and Mathematical Foundation

Regression Analysis is a supervised learning technique used to predict continuous numerical values by modeling the relationship between input features and target variables.

🎯 The Fundamental Equation:

y = f(X) + ε
  • y: Target variable (what we want to predict)
  • f(X): The function we want to learn
  • X: Input features (independent variables)
  • ε: Error term (noise and unmeasured factors)

Real-World Regression Examples:

Real Estate Pricing

Predict: House price

Features: Size, location, bedrooms, age

Why Linear: Generally, larger houses cost more

Stock Market Analysis

Predict: Stock price movement

Features: Trading volume, market indicators

Challenge: Non-linear, highly volatile

Weather Forecasting

Predict: Tomorrow's temperature

Features: Today's weather, pressure, humidity

Complexity: Seasonal patterns, non-linear trends

Linear Regression: The Foundation

Mathematical Deep Dive

Simple Linear Regression Formula:

y = β₀ + β₁x + ε
  • β₀ (Beta Zero): Y-intercept - value when x = 0
  • β₁ (Beta One): Slope - change in y per unit change in x
  • x: Independent variable (feature)
  • y: Dependent variable (target)
  • ε: Random error term

Key Assumptions of Linear Regression:

1️⃣ Linearity

The relationship between X and y is linear

Check: Scatter plots, residual plots

2️⃣ Independence

Observations are independent of each other

Important for time series and spatial data

3️⃣ Homoscedasticity

Constant variance of residuals

Check: Residuals vs fitted values plot

4️⃣ Normality

Residuals are normally distributed

Check: Q-Q plots, Shapiro-Wilk test

How Linear Regression Works - The Math Behind the Magic:

Ordinary Least Squares (OLS) Method:

Linear regression finds the best line by minimizing the sum of squared residuals:

Objective Function:

Minimize: Σ(yᵢ - ŷᵢ)²

Where ŷᵢ = β₀ + β₁xᵢ (predicted value)

The Solution (for simple linear regression):

Slope (β₁):

β₁ = Σ((xᵢ - x̄)(yᵢ - ȳ)) / Σ((xᵢ - x̄)²)

Intercept (β₀):

β₀ = ȳ - β₁x̄
Intuition: The slope tells us the correlation scaled by the ratio of standard deviations. The intercept ensures the line passes through the point (x̄, ȳ).

Multiple Linear Regression

Extending to Multiple Features

Multiple Regression Formula:

y = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ + ε

Or in matrix form: y = Xβ + ε

  • p: Number of features
  • βⱼ: Coefficient for feature xⱼ
  • X: Design matrix (n × p matrix)
  • β: Parameter vector

Feature Importance and Interpretation:

Coefficient Interpretation:
  • Magnitude: Larger |βⱼ| means more influence on prediction
  • Sign: Positive β increases y, negative β decreases y
  • Units: βⱼ represents change in y per unit change in xⱼ
⚠️ Important Caveat: Coefficients represent the effect of changing one feature while holding all others constant. In practice, features are often correlated!

Multicollinearity: When Features are Too Similar

❌ Problems with Highly Correlated Features:
  • Unstable coefficient estimates
  • Difficult to interpret individual feature importance
  • High variance in predictions
  • Numerical instability in matrix inversion
Detection Methods:
  • Correlation Matrix: Look for correlations > 0.8
  • Variance Inflation Factor (VIF): VIF > 10 indicates problems
  • Condition Number: > 30 suggests multicollinearity
Solutions:
  • Remove highly correlated features
  • Use Principal Component Analysis (PCA)
  • Apply regularization (Ridge, Lasso)
  • Collect more data if possible

️ Polynomial Regression: Capturing Non-Linear Relationships

Beyond Straight Lines

🔄 Polynomial Transformation:

Polynomial regression extends linear regression by adding polynomial features:

Degree 2 (Quadratic):

y = β₀ + β₁x + β₂x² + ε

Degree 3 (Cubic):

y = β₀ + β₁x + β₂x² + β₃x³ + ε

General Form:

y = β₀ + β₁x + β₂x² + ... + βₐxᵈ + ε
Key Insight: Polynomial regression is still linear in the parameters β! We just transform the features.

️ The Bias-Variance Tradeoff

Underfitting (High Bias)
  • Model too simple
  • Cannot capture underlying pattern
  • Poor performance on both training and test data
  • Solution: Increase model complexity
Overfitting (High Variance)
  • Model too complex
  • Memorizes training data noise
  • Good training, poor test performance
  • Solution: Reduce complexity or add data
Sweet Spot: Find the optimal degree that minimizes total error = bias² + variance + noise

Choosing the Right Polynomial Degree

Practical Guidelines:
  • Degree 1: Linear relationship
  • Degree 2: One curve (parabola) - good for many real-world phenomena
  • Degree 3-4: More complex curves with multiple turns
  • Degree >5: Usually overfitting unless you have lots of data
Selection Methods:
  1. Cross-Validation: Test different degrees, pick best CV score
  2. Learning Curves: Plot training vs validation error
  3. Information Criteria: AIC, BIC balance fit and complexity
  4. Domain Knowledge: Physics/business understanding of relationship
Pro Tip: Start simple (degree 1-2) and increase complexity only if validation performance improves!

Regression Evaluation Metrics

Measuring Model Performance

Essential Regression Metrics:

1️⃣ Mean Squared Error (MSE)
MSE = (1/n) Σ(yᵢ - ŷᵢ)²

Pros: Heavily penalizes large errors

Cons: Same units as y², hard to interpret

Use when: Large errors are especially bad

2️⃣ Root Mean Squared Error (RMSE)
RMSE = √MSE

Pros: Same units as y, interpretable

Cons: Still penalizes large errors heavily

Use when: You want MSE benefits with interpretability

3️⃣ Mean Absolute Error (MAE)
MAE = (1/n) Σ|yᵢ - ŷᵢ|

Pros: Robust to outliers, easy to interpret

Cons: Doesn't distinguish small vs large errors

Use when: You have outliers or all errors are equally bad

4️⃣ R-squared (R²)
R² = 1 - (SS_res / SS_tot)

Range: 0 to 1 (higher is better)

Interpretation: % of variance explained

Caveat: Can be misleading with non-linear relationships

Which Metric to Use?

  • RMSE: Most common, good for normally distributed errors
  • MAE: When you have outliers or skewed error distribution
  • R²: For understanding model explanatory power
  • Multiple metrics: Always use several metrics for complete picture!

️ Regularization: Preventing Overfitting

Ridge and Lasso Regression

Why Regularization?

When we have many features or polynomial terms, the model can become too complex and overfit. Regularization adds a penalty term to prevent this.

General Regularized Objective:

Minimize: MSE + λ × Penalty(β)

Where λ (lambda) controls the strength of regularization

Ridge Regression (L2)

Penalty = Σβⱼ²
Characteristics:
  • Shrinks coefficients toward zero
  • Keeps all features (no feature selection)
  • Good when all features are somewhat relevant
  • Handles multicollinearity well
Best for: Many relevant features

Lasso Regression (L1)

Penalty = Σ|βⱼ|
Characteristics:
  • Can set coefficients exactly to zero
  • Automatic feature selection
  • Produces sparse models
  • Good when only some features are relevant
Best for: Feature selection needed

️ Choosing λ (Regularization Strength):

  • λ = 0: No regularization (standard regression)
  • Small λ: Light penalty, close to unregularized
  • Large λ: Heavy penalty, coefficients shrink toward zero
  • λ → ∞: All coefficients approach zero (underfitting)
Selection Method: Use cross-validation to find optimal λ that minimizes validation error!

Key Takeaways and Best Practices

✅ Chapter 2 Mastery:

• Linear regression mathematical foundations and assumptions

• Multiple regression with feature importance interpretation

• Polynomial regression for non-linear relationships

• Comprehensive evaluation metrics (MSE, RMSE, MAE, R²)

• Overfitting detection and regularization techniques

• Practical model selection and validation strategies

🎓 Practical Guidelines for Regression Success:

  1. Always start simple: Begin with linear regression before trying polynomial
  2. Check assumptions: Plot residuals to verify linearity and homoscedasticity
  3. Handle multicollinearity: Use correlation matrices and VIF
  4. Use multiple metrics: Don't rely on R² alone
  5. Validate properly: Use cross-validation for model selection
  6. Consider regularization: Especially with many features or limited data
  7. Understand your domain: Let business knowledge guide feature engineering

Hands-On Python Implementation

Linear Regression from Scratch

Complete Boston Housing Example

# Import essential libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import GridSearchCV, cross_val_score
# Load the dataset
boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['PRICE'] = boston.target
print("Boston Housing Dataset Shape:", df.shape)
print("\nFeatures:", list(boston.feature_names))
print("\nFirst 5 rows:")
print(df.head())
# Exploratory Data Analysis
print("Dataset Info:")
print(df.info())
print("\nMissing values:")
print(df.isnull().sum())
# Statistical summary
print("\nStatistical Summary:")
print(df.describe())
# Correlation analysis
plt.figure(figsize=(12, 10))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Boston Housing: Feature Correlation Matrix')
plt.tight_layout()
plt.show()
# Features most correlated with price
price_corr = correlation_matrix['PRICE'].abs().sort_values(ascending=False)
print("\nFeatures most correlated with PRICE:")
print(price_corr[1:6])
# Data Preparation
X = boston.data
y = boston.target
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Linear Regression Implementation

# 1. Simple Linear Regression (using one feature)
simple_model = LinearRegression()
X_simple = X_train[:, 5].reshape(-1, 1) # RM (average rooms)
X_simple_test = X_test[:, 5].reshape(-1, 1)
simple_model.fit(X_simple, y_train)
y_pred_simple = simple_model.predict(X_simple_test)
print("Simple Linear Regression (RM vs PRICE):")
print(f"Coefficient (slope): {simple_model.coef_[0]:.3f}")
print(f"Intercept: {simple_model.intercept_:.3f}")
print(f"R² Score: {r2_score(y_test, y_pred_simple):.3f}")
# Visualization
plt.figure(figsize=(10, 6))
plt.scatter(X_simple_test, y_test, alpha=0.7, label='Actual')
plt.plot(X_simple_test, y_pred_simple, color='red', linewidth=2, label='Predicted')
plt.xlabel('Average Rooms (RM)')
plt.ylabel('House Price ($1000s)')
plt.title('Simple Linear Regression: Rooms vs Price')
plt.legend()
plt.show()

Multiple Linear Regression

# Multiple Linear Regression
mlr_model = LinearRegression()
mlr_model.fit(X_train_scaled, y_train)
y_pred_mlr = mlr_model.predict(X_test_scaled)
# Calculate metrics
mse = mean_squared_error(y_test, y_pred_mlr)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred_mlr)
r2 = r2_score(y_test, y_pred_mlr)
print("Multiple Linear Regression Results:")
print(f"MSE: {mse:.3f}")
print(f"RMSE: {rmse:.3f}")
print(f"MAE: {mae:.3f}")
print(f"R² Score: {r2:.3f}")
# Feature importance (coefficient analysis)
feature_importance = pd.DataFrame({
'feature': boston.feature_names,
'coefficient': mlr_model.coef_,
'abs_coefficient': np.abs(mlr_model.coef_)
}).sort_values('abs_coefficient', ascending=False)
print("\nFeature Importance (by coefficient magnitude):")
print(feature_importance.head(10))

Polynomial Regression

# Polynomial Regression with different degrees
degrees = [1, 2, 3, 4]
poly_results = {}
for degree in degrees:
# Create polynomial features
poly_features = PolynomialFeatures(degree=degree, include_bias=False)
X_train_poly = poly_features.fit_transform(X_train_scaled)
X_test_poly = poly_features.transform(X_test_scaled)
# Train model
poly_model = LinearRegression()
poly_model.fit(X_train_poly, y_train)
# Predictions
y_train_pred = poly_model.predict(X_train_poly)
y_test_pred = poly_model.predict(X_test_poly)
# Calculate scores
train_score = r2_score(y_train, y_train_pred)
test_score = r2_score(y_test, y_test_pred)
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
poly_results[degree] = {
'train_r2': train_score,
'test_r2': test_score,
'test_rmse': test_rmse,
'features': X_train_poly.shape[1]
}
# Display results
print("Polynomial Regression Results:")
print("Degree | Features | Train R² | Test R² | Test RMSE")
print("-" * 50)
for degree, results in poly_results.items():
print(f" {degree} | {results['features']:3d} | {results['train_r2']:.3f} | {results['test_r2']:.3f} | {results['test_rmse']:.3f}")

Regularization: Ridge and Lasso

# Ridge Regression with different alpha values
alphas = [0.1, 1.0, 10.0, 100.0, 1000.0]
ridge_results = {}
for alpha in alphas:
ridge_model = Ridge(alpha=alpha)
ridge_model.fit(X_train_scaled, y_train)
y_pred_ridge = ridge_model.predict(X_test_scaled)
ridge_results[alpha] = {
'r2': r2_score(y_test, y_pred_ridge),
'rmse': np.sqrt(mean_squared_error(y_test, y_pred_ridge))
}
# Lasso Regression
lasso_results = {}
for alpha in alphas:
lasso_model = Lasso(alpha=alpha, max_iter=1000)
lasso_model.fit(X_train_scaled, y_train)
y_pred_lasso = lasso_model.predict(X_test_scaled)
# Count non-zero coefficients
non_zero_coefs = np.sum(lasso_model.coef_ != 0)
lasso_results[alpha] = {
'r2': r2_score(y_test, y_pred_lasso),
'rmse': np.sqrt(mean_squared_error(y_test, y_pred_lasso)),
'features_selected': non_zero_coefs
}
# Display regularization results
print("Ridge Regression Results:")
print("Alpha | R² | RMSE")
print("-" * 20)
for alpha, results in ridge_results.items():
print(f"{alpha:6.1f} | {results['r2']:.3f} | {results['rmse']:.3f}")
print("\nLasso Regression Results:")
print("Alpha | R² | RMSE | Features Selected")
print("-" * 35)
for alpha, results in lasso_results.items():
print(f"{alpha:6.1f} | {results['r2']:.3f} | {results['rmse']:.3f} | {results['features_selected']:8d}")

Model Selection with GridSearchCV

# Grid search for optimal Ridge alpha
param_grid = {'alpha': [0.01, 0.1, 1, 10, 100, 1000]}
ridge_grid = GridSearchCV(
Ridge(), param_grid, cv=5, scoring='r2', n_jobs=-1
)
ridge_grid.fit(X_train_scaled, y_train)
print("Best Ridge parameters:", ridge_grid.best_params_)
print("Best cross-validation score:", ridge_grid.best_score_.round(3))
# Final model evaluation
best_ridge = ridge_grid.best_estimator_
final_predictions = best_ridge.predict(X_test_scaled)
final_r2 = r2_score(y_test, final_predictions)
final_rmse = np.sqrt(mean_squared_error(y_test, final_predictions))
print(f"\nFinal Model Performance:")
print(f"Test R²: {final_r2:.3f}")
print(f"Test RMSE: {final_rmse:.3f}")
# Residual analysis
residuals = y_test - final_predictions
plt.figure(figsize=(12, 4))
# Residual plot
plt.subplot(1, 2, 1)
plt.scatter(final_predictions, residuals, alpha=0.7)
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
# Actual vs Predicted
plt.subplot(1, 2, 2)
plt.scatter(y_test, final_predictions, alpha=0.7)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'red', linewidth=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted')
plt.tight_layout()
plt.show()
Expected Results:
  • Linear Regression R² ≈ 0.67
  • Polynomial features improve performance but risk overfitting
  • Ridge/Lasso regularization prevents overfitting
  • GridSearchCV finds optimal hyperparameters
  • Final RMSE around 4-5 (thousands of dollars)

Congratulations!

You've completed Chapter 2 and built a solid foundation in Regression!

← Chapter 1: Introduction Next: Chapter 3 - Classification →