Chapter 2: Python Stack for ML & Data Science
Python Stack for ML & Data Science in ML Software Engineering: Interview Concept Review.
Learning Objectives
By the end of this chapter, you will be able to:
- Relate Python Stack for ML & Data Science to common ML software engineering interview questions and trade-offs.
- Explain when this topic deserves a deeper pass through another tutorial on this site versus staying at recap depth.
- Surface assumptions, pitfalls, and follow-up probes an interviewer is likely to use.
Python stack recruiters expect
Interviewers rarely ask trivia about PEP8; they probe whether you can move from experimental notebook chaos to repeatable scripts answering “what changed?” and “does it regress?”.
Venv or conda? State one choice for isolation, pinning requirements.txt or conda env export, reproducible installs on CI—not both philosophically hedged.
Notebook vs modules. Notebooks accelerate EDA but hide state. Know how you migrate cells into parameterized Python modules callable from tests.
Tooling narration
Use debugger / logging instead of scattering print; mention structuring logs with contexts (experiment id, data slice, git SHA).
For performance curiosity, %timeit equivalent mentality: differentiate micro-benchmarks from wall-clock pipelines; GPUs change story—keep IO + Python overhead honest.
Type hints optional yet signal collaboration maturity with mypy-lite discipline.
NumPy broadcasting & shaping
Eight out of ten vectorization bugs hinge on unintended broadcasting. Practice describing shapes orally: weights [d_out, d_in], batches [B, …], attention [B, heads, T, Tk].
import numpy as np
a = np.random.randn(64, 10)
b = np.random.randn(10, 32)
np.matmul(a, b).shape # (64, 32)
Relate reshaping to memory layout (Fortran vs C) only if interviewer leads there; skip unless perf role.
Common slips
- Claiming reproducibility without seeding RNGs/data sharding strategy.
- Mutating globals across notebook cells unnoticed.
- Skipping mention of profiler when diagnosing slow training—I/O dominates more than softmax.
1. Best rationale for pinning dependency versions?