Build an end-to-end ML classification pipeline
Company: Nextdoor
Role: Machine Learning Engineer
Category: ML System Design
Difficulty: medium
Interview Round: Technical Screen
Given a tabular dataset in a CSV file, implement an end-to-end pipeline to perform a classification task. Requirements:
(
1) load the data;
(
2) create stratified train/validation/test splits;
(
3) handle missing values and encode categorical features;
(
4) standardize numeric features;
(
5) train a simple baseline (e.g., logistic regression) and at least one stronger model (e.g., gradient boosting or a small neural network);
(
6) tune key hyperparameters with cross-validation;
(
7) report accuracy, precision, recall, and ROC-AUC on validation and test;
(
8) persist the trained model and preprocessing steps;
(
9) implement batch inference via a predict(input_csv_path, output_csv_path) function or CLI. If using a neural network, write a correct training loop with optimizer initialization, forward pass, loss computation, backward pass, and an explicit optimizer step. Briefly explain design choices and how you would productionize this pipeline.
Quick Answer: This question evaluates a candidate's competency in building end-to-end tabular classification pipelines, including data loading and splitting, missing-value handling, categorical encoding, feature scaling, model training and comparison, hyperparameter tuning, metric-based evaluation, model persistence, and batch inference.