Chapter 3: Python Implementation

Learn how to implement decision trees in Python using scikit-learn and build your own from scratch.

Learning Objectives

Learn to use scikit-learn's DecisionTreeClassifier
Understand key parameters and how to tune them
Build a custom decision tree from scratch
Visualize decision trees effectively
Handle real-world datasets with decision trees

Using scikit-learn

🐍 scikit-learn: Your Decision Tree Toolkit

scikit-learn provides powerful, optimized implementations of decision trees that handle all the complex math for you. You just need to understand how to use them effectively!

Basic Implementation

Simple Decision Tree Example

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load sample data
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train decision tree
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Make predictions
predictions = clf.predict(X_test)
accuracy = clf.score(X_test, y_test)

print(f"Accuracy: {accuracy:.2f}")

Key Parameters

🎯 criterion

How to measure split quality

Options: 'gini', 'entropy', 'log_loss'

📏 max_depth

Maximum depth of the tree

Default: None (unlimited)

🍃 min_samples_split

Minimum samples needed to split

Default: 2

🍂 min_samples_leaf

Minimum samples in leaf nodes

Default: 1

Custom Implementation

🔨 Building Your Own Decision Tree

While scikit-learn is powerful, understanding how to build a decision tree from scratch helps you truly understand the algorithm and customize it for specific needs.

Core Components

Node Class

class TreeNode:
    def __init__(self, feature=None, threshold=None, left=None, right=None, value=None):
        self.feature = feature      # Feature to split on
        self.threshold = threshold  # Threshold value
        self.left = left           # Left child
        self.right = right         # Right child
        self.value = value         # Prediction (for leaf nodes)
        
    def is_leaf(self):
        return self.value is not None

Tree Visualization

👁️ Seeing Your Decision Tree

Visualizing decision trees helps you understand how they make decisions and debug any issues. There are several ways to visualize trees in Python.

Text Visualization

Using sklearn.tree.export_text

from sklearn.tree import export_text

# Export tree as text
tree_rules = export_text(clf, feature_names=iris.feature_names)
print(tree_rules)

Graph Visualization

Using sklearn.tree.plot_tree

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

# Plot the decision tree
plt.figure(figsize=(12, 8))
plot_tree(clf, feature_names=iris.feature_names, 
          class_names=iris.target_names, filled=True)
plt.title("Decision Tree Visualization")
plt.show()

Parameters & Tuning

⚙️ Tuning Your Decision Tree

Decision trees have several parameters that control their behavior. Understanding these parameters is crucial for building effective models.

Preventing Overfitting

📏 Maximum Depth

Limit how deep the tree can grow

Tip: Start with 3-5, increase if underfitting

�� Minimum Samples Split

Require minimum samples to create a split

Tip: Use 10-20 for small datasets

�� Minimum Samples Leaf

Require minimum samples in leaf nodes

Tip: Use 5-10 to prevent tiny leaves

🔍 Maximum Features

Maximum features to consider at each split

Tip: Use 'sqrt' or 'log2' for high-dimensional data

Interactive Python Demo

�� Try Python Implementation

Experiment with different parameters and see how they affect the decision tree performance and structure!

Click "Load Dataset" to start

Python code and results will appear here

Chapter 3 Quiz

🧠 Test Your Python Knowledge

Answer these questions about Python implementation!

Question 1: Which parameter controls the maximum depth of a decision tree?

min_depth max_depth depth_limit tree_depth

Question 2: What is the default criterion for DecisionTreeClassifier?

entropy gini information_gain variance