Chapter 1: Introduction to Machine Learning
Introduction to Machine Learning in Machine Learning Fundamentals.
Learning Objectives
By the end of this chapter, you will be able to:
- Explain the core machine-learning ideas behind Introduction to Machine Learning.
- Connect Introduction to Machine Learning to practical model-building workflows.
- Recognize common assumptions, pitfalls, and evaluation choices.
Chapter 1: Introduction to Machine Learning
Understanding the fundamentals of machine learning, types of ML, and setting up your development environment.
What is Machine Learning?
Machine Learning is a subset of artificial intelligence that enables computers to learn and make decisions from data without being explicitly programmed for every task. Instead of following rigid rules, ML algorithms identify patterns in data and use them to make predictions or decisions.
Simple Example: Predicting House Prices
Let's see how ML works with a simple example. Enter house features to see a prediction:
Key Concepts
- Data: The foundation of ML - without quality data, ML cannot work effectively
- Features: The input variables that the model uses to make predictions
- Labels/Targets: The output we want to predict
- Training: The process of teaching the model to recognize patterns
- Inference: Using the trained model to make predictions on new data
Types of Machine Learning
1. Supervised Learning
Learning from labeled data where we know the correct answers.
Regression
Predicting continuous values (e.g., house prices, temperature)
Classification
Predicting categories (e.g., spam/not spam, cat/dog)
2. Unsupervised Learning
Finding patterns in data without labeled examples.
Clustering
Grouping similar data points together
Dimensionality Reduction
Reducing the number of features while preserving information
3. Reinforcement Learning
Learning through trial and error with rewards and penalties.
Simple RL Example: Grid World
An agent learns to navigate to a goal by receiving rewards.
Machine Learning Workflow
1. Data Collection
Gathering relevant data from various sources (databases, APIs, files)
import pandas as pd
# Load data from CSV
data = pd.read_csv('house_prices.csv')
# Load data from API
import requests
response = requests.get('https://api.example.com/data')
data = response.json()
2. Data Preprocessing
Raw data is rarely ready for modeling. Data preprocessing turns messy tables into consistent numeric inputs—without mistakes that only show up at test time.
What you are trying to fix
- Missing values — optional fields, sensor gaps, export errors
- Wrong types — numbers as strings, dates as text
- Duplicates & invalid rows — repeated IDs, negative prices
- Outliers — extremes that dominate distance-based models
- Scale mismatch — features on very different numeric ranges
- Unencoded categories — models need numbers, not labels like
"Seattle"
Recommended pipeline (in order)
- Explore & profile —
.info(), missing counts, distributions - Clean structure — duplicates, parse dates, coerce types
- Handle missingness — drop, median, mode, or model imputation per column
- Treat outliers — IQR/z-score, cap, log transform, or keep with justification
- Encode categoricals — one-hot (nominal) or ordinal mapping
- Scale numerics — for k-NN, SVM, neural nets, regularized linear models
- Train/test split — then fit transformers on train only
1. Explore before you transform
import pandas as pd
df = pd.read_csv('house_prices.csv')
print(df.shape)
print(df.dtypes)
print(df.isnull().sum().sort_values(ascending=False))
print(df.describe(include='all'))
If a column is 40% missing, blind fillna(mean) can distort the signal—consider median fill, dropping, or model-based imputation.
2. Handle missing values
Drop
Tiny missingness or unusable columns.
df.dropna(subset=['price'])
Impute
Median (numeric), mode (categorical).
df['bedrooms'].fillna(df['bedrooms'].median())
sklearn Imputer
Reusable inside Pipelines.
SimpleImputer(strategy='median')
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')
X_train_clean = imputer.fit_transform(X_train)
X_test_clean = imputer.transform(X_test)
3. Types, duplicates, invalid values
df['listed_date'] = pd.to_datetime(df['listed_date'], errors='coerce')
df['price'] = pd.to_numeric(df['price'], errors='coerce')
df = df.drop_duplicates(subset=['listing_id'])
df = df[df['price'] > 0]
4. Outliers
Not every extreme is wrong—luxury homes exist. Combine domain knowledge with IQR/z-score or RobustScaler.
Q1, Q3 = df['price'].quantile([0.25, 0.75])
IQR = Q3 - Q1
df = df[df['price'].between(Q1 - 1.5*IQR, Q3 + 1.5*IQR)]
5. Encode categoricals
One-hot for nominal fields; ordinal when order matters (low < medium < high).
df = pd.get_dummies(df, columns=['neighborhood'], drop_first=True)
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
6. Feature scaling
Tree models (Random Forest, XGBoost) often skip scaling. Distance- and gradient-based models usually need it.
StandardScaler
Zero mean, unit variance.
from sklearn.preprocessing import StandardScaler
MinMaxScaler
Bound features to [0, 1].
MinMaxScaler()
RobustScaler
Median/IQR; resists outliers.
RobustScaler()
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
7. Avoid data leakage
fit imputers, encoders, or scalers on the full dataset before train_test_split.
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipe = Pipeline([('prep', preprocess), ('model', LinearRegression())])
pipe.fit(X_train, y_train)
Common mistakes
- Imputing the target
y - Encoding on train+test before split (column mismatch at inference)
- Mean imputation on skewed price/income without checking distribution
- Removing rare but valid cases as outliers
Interactive walkthrough: dirty → model-ready
Step through a miniature house-price table; each stage shows what changed and why.
3. Feature Engineering
Creating new features or transforming existing ones
# Create new features
data['price_per_sqft'] = data['price'] / data['square_feet']
data['total_rooms'] = data['bedrooms'] + data['bathrooms']
# Encode categorical variables
data = pd.get_dummies(data, columns=['neighborhood'])
4. Model Selection & Training
Choosing appropriate algorithms and training the model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
5. Model Evaluation
Assessing model performance using appropriate metrics
Evaluation Metrics Demo
6. Model Deployment
Making the model available for real-world use
import pickle
# Save model
with open('house_price_model.pkl', 'wb') as f:
pickle.dump(model, f)
# Load and use model
with open('house_price_model.pkl', 'rb') as f:
loaded_model = pickle.load(f)
prediction = loaded_model.predict([[1500, 3, 2]])
Setting Up Your Development Environment
1. Python Installation
Make sure you have Python 3.8+ installed:
# Check Python version
python --version
# Should show Python 3.8.x or higher
2. Virtual Environment
Create a virtual environment to manage dependencies:
# Create virtual environment
python -m venv ml_env
# Activate on Windows
ml_env\Scripts\activate
# Activate on macOS/Linux
source ml_env/bin/activate
3. Install Required Packages
Install the essential ML libraries:
# Install core packages
pip install numpy pandas scikit-learn matplotlib seaborn
# Install additional useful packages
pip install jupyter notebook plotly
4. Verify Installation
Test Your Setup
5. Jupyter Notebook Setup
Jupyter Notebook is excellent for ML development:
# Install Jupyter
pip install jupyter
# Start Jupyter Notebook
jupyter notebook
# This will open a browser window with Jupyter interface
Essential Python Libraries for ML
NumPy
Fundamental package for numerical computing
import numpy as np
# Create arrays
arr = np.array([1, 2, 3, 4, 5])
matrix = np.array([[1, 2], [3, 4]])
# Mathematical operations
mean = np.mean(arr)
std = np.std(arr)
Pandas
Data manipulation and analysis
import pandas as pd
# Read data
df = pd.read_csv('data.csv')
# Data exploration
print(df.head())
print(df.describe())
# Data filtering
filtered = df[df['price'] > 100000]
Scikit-learn
Machine learning algorithms and tools
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# Create and train model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
predictions = model.predict(X_test)
Matplotlib & Seaborn
Data visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
# Create plots
plt.scatter(x, y)
plt.title('Scatter Plot')
plt.show()
# Seaborn for statistical plots
sns.regplot(x='feature', y='target', data=df)
Library Comparison Demo
See how different libraries work together:
Chapter 1 Quiz
Test your understanding of the concepts covered in this chapter.