Decision Trees

A decision tree recursively splits the dataset into smaller regions such that the prediction error within each region is minimized.

STEP 1: Objective of Decision Tree

For regression trees, the objective is to minimize 01.00.00 Mean Squared Error:

MSE = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - \hat{y})^{2}

At each step, the algorithm searches for a split that minimizes the MSE of the resulting partitions.

The goal is to create regions where the target values are as similar as possible.

Code

import pandas as pd

# Load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data"

columns = ["buying", "maint", "doors", "persons", "lug_boot", "safety", "class"]
df = pd.read_csv(url, names=columns)

# Separate features and target
X = df.drop("class", axis=1)
y = df["class"]
df.head(10)

STEP 2: Prediction in Each Region

In a regression tree, the prediction assigned to each region is the mean of the target values within that region.

For a region $R$ :

{\hat{y}}_{R} = \frac{1}{| R |} \sum_{i \in R} y_{i}

The mean is chosen because it minimizes the squared error within that region.

Code

def predict_region(y_values):
    return y_values.mean()

Explanation of Code

y_values represents all target values in a region.
The function returns the mean, which is the optimal prediction for minimizing MSE.

STEP 3: Finding the Best Split

At each node, the decision tree evaluates all possible splits across all features and selects the split that minimizes the total Mean Squared Error after the split.

For a feature $X_{j}$ , we first sort the values and consider candidate splits:

s = \frac{x_{j}^{(k)} + x_{j}^{(k + 1)}}{2}

For each split $s$ , we divide the data into:

Left region: $R_{L} = x ∣ X_{j} \leq s$
Right region: $R_{R} = x ∣ X_{j} > s$

Prediction in each region

{\hat{y}}_{L} = \frac{1}{n_{L}} \sum_{i \in R_{L}} y_{i}

{\hat{y}}_{R} = \frac{1}{n_{R}} \sum_{i \in R_{R}} y_{i}

Total MSE after split

{MSE}_{s p l i t} = \frac{n_{L}}{n} \sum_{i \in R_{L}} (y_{i} - {\hat{y}}_{L})^{2} + \frac{n_{R}}{n} \sum_{i \in R_{R}} (y_{i} - {\hat{y}}_{R})^{2}

Objective

min_{j, s} {MSE}_{s p l i t}

Key Idea

The best split is the one that creates two regions where the target values are as homogeneous as possible.

Code

import numpy as np

def find_best_split(X, y):
    best_feature = None
    best_split = None
    best_mse = float("inf")

    n, p = X.shape

    for feature in range(p):
        sorted_idx = np.argsort(X[:, feature])
        X_sorted = X[sorted_idx, feature]
        y_sorted = y[sorted_idx]

        for i in range(1, n):
            s = (X_sorted[i-1] + X_sorted[i]) / 2

            left_y = y_sorted[:i]
            right_y = y_sorted[i:]

            y_left_mean = left_y.mean()
            y_right_mean = right_y.mean()

            mse_left = np.mean((left_y - y_left_mean) ** 2)
            mse_right = np.mean((right_y - y_right_mean) ** 2)

            mse_split = (len(left_y)/n) * mse_left + (len(right_y)/n) * mse_right

            if mse_split < best_mse:
                best_mse = mse_split
                best_feature = feature
                best_split = s

    return best_feature, best_split

Explanation of Code

Loop over each feature
Sort data along that feature
Try all midpoint splits
Compute:
- mean of left and right
- MSE of both sides
Compute weighted MSE
Track the split with lowest MSE

This is a greedy algorithm. It chooses the best split at the current node without considering future splits.

STEP 4: Split the Tree

After finding the optimal feature $X_{j}$ and split value $s^{*}$ from Step 3, the dataset is partitioned into two regions.

R_{L} = x ∣ X_{j} \leq s^{*}

R_{R} = x ∣ X_{j} > s^{*}

Prediction in Each Region

For each region, the prediction is the mean of the target values:

{\hat{y}}_{L} = \frac{1}{n_{L}} \sum_{i \in R_{L}} y_{i}

{\hat{y}}_{R} = \frac{1}{n_{R}} \sum_{i \in R_{R}} y_{i}

The split creates two child nodes, each with its own prediction and its own data subset.

Code

def split_data(X, y, feature, threshold):
    left_mask = X[:, feature] <= threshold
    right_mask = X[:, feature] > threshold

    X_left = X[left_mask]
    y_left = y[left_mask]

    X_right = X[right_mask]
    y_right = y[right_mask]

    return X_left, y_left, X_right, y_right

STEP 5: Recursive Splitting (Growing the Tree)

After splitting the data into child nodes, the decision tree applies the same process recursively on each node until a stopping condition is met.

Mathematical View

At each node, we again solve:

min_{j, s} {MSE}_{s p l i t}

but now using only the data in that node.

Each node represents a region $R$ , and we further partition it into:

R_{L}, R_{R}

This creates a hierarchical partitioning of the feature space.

Key Idea

Each split reduces the MSE locally, and the tree grows by repeatedly applying:

Find best split
Partition data
Repeat on each child

Code (Recursive Structure)

def build_tree(X, y, depth=0, max_depth=3):
    # stopping condition
    if len(y) <= 1 or depth == max_depth:
        return {"prediction": y.mean()}

    feature, threshold = find_best_split(X, y)

    X_left, y_left, X_right, y_right = split_data(X, y, feature, threshold)

    return {
        "feature": feature,
        "threshold": threshold,
        "left": build_tree(X_left, y_left, depth + 1, max_depth),
        "right": build_tree(X_right, y_right, depth + 1, max_depth)
    }

Explanation of Code

At each node:
- Find best split
- Split data
- Call the same function on left and right
This is recursion
Tree grows top-down

Recursion stops when:

Max depth reached
Number of samples is too small
MSE reduction is very small
All target values are the same

STEP 6: Leaf Node (Final Prediction)

A leaf node is a terminal node where no further splitting occurs, and it outputs a prediction.

For a leaf node region $R$ :

\hat{y} = \frac{1}{| R |} \sum_{i \in R} y_{i}

Each leaf stores a constant value which is the mean of the target values in that region.

Code

if len(y) <= 1 or depth == max_depth:
    return {"prediction": y.mean()}

Explanation

When stopping condition is met
We compute mean of $y$
That becomes the prediction

🎯 Interview Questions

→ 02.01.01 Decision Trees - Interview