Chapter 10: Supervised Learning — Interview Deep Review - ML Software Engineering: Interview Concept Review

Learning Objectives

By the end of this chapter, you will be able to:

Relate Supervised Learning — Interview Deep Review to common ML software engineering interview questions and trade-offs.
Explain when this topic deserves a deeper pass through another tutorial on this site versus staying at recap depth.
Surface assumptions, pitfalls, and follow-up probes an interviewer is likely to use.

Supervised I recap: linear world + regularization

Linear regression models targets as affine functions of features—works when signal approximately linear and features scaled. Polynomial features increase expressivity but balloon variance; pair with penalties or cross-validation depth control.

L2 ridge shrinks coefficients jointly, stabilizing ill-conditioned designs. L1 lasso promotes sparsity—good when many irrelevant features; informs feature selection narrative but watch correlated groups: lasso may arbitrarily pick one.

Logistic regression applies sigmoid on linear score to map ℝ→(0,1); never treat raw linear scores as calibrated probabilities without justification—link + training loss matter.

import numpy as np
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

k-NN and distance thinking

Lazy learner: stores data, votes from neighborhood. Curse of dimensionality makes distance metrics less meaningful in high-D sparse spaces—interviewers want you to tie to scaling, PCA, or embeddings.

Metrics: articulate precision/recall trade, ROC vs PR per prior chapter, and class-weighted strategies for imbalance.

Supervised II: margins, Bayes, ensembles

Naive Bayes: cheap generative baseline with conditional independence assumption—fails when correlated features break naivety but still strong text spam baselines.

SVM: maximize margin; kernels lift to implicit feature space. Cost C trades margin vs violations; γ in RBF tweaks locality. Contrast with logistic: SVM focuses on support vectors; logistic supplies probabilistic semantics after calibration.

Decision trees / RF / boosting: see prior two chapters for depth; here emphasize when interviewer expects you to pivot from linear models to tree ensembles (nonlinear structure, heterogenous features, partial missingness handling with surrogate splits—high-level ok).

Bias–variance & stacking sound bite

Bagging trims variance; boosting attacks bias; stacking meta-learns but needs careful OOF generation to avoid leakage—say that explicitly.

Example FAANG-style follow-ups

L1 vs L2 effect on correlated predictors?
Why not use raw linear regression outputs as probabilities?
ROC mechanics in words without memorizing every threshold.
When does SVM outperform random forests and vice versa?

Go deeper on this site

1. Largest difference between logistic vs linear regression on binary labels?

Logistic uses Bernoulli-logit link; outputs proper probability model after training/calibration—not raw unbounded scores.
Logistic never works on tabular data.

2. SVM margin maximization mainly improves:

Generalization by seeking robust separator with slack controlled by C.
Training loss to zero always regardless of C.