Build an end-to-end ML classification pipeline

Q: Build an end-to-end ML classification pipeline

This is a ML System Design interview question from Nextdoor for Machine Learning Engineer roles. View the full question and solution on PracHub.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

Build a Reproducible End-to-End Binary Classification Pipeline for a CSV Dataset

Context

You are given a single CSV file that fits in memory. The dataset contains:

Mixed numeric and categorical feature columns
A binary label/target column
Optional ID columns to drop

Your task is to implement a modular, reproducible, and command-line runnable training pipeline that prevents data leakage and produces a test-set report. Use Python with common ML tooling.

Requirements

Load data from CSV
Validate basic schema (label present and binary, dtypes make sense, optional ID columns dropped)
Handle missing values appropriately for numeric and categorical features
Split into train/validation/test with stratification on the label
Encode categorical variables and scale/normalize numeric features
Train a reasonable baseline classifier
Choose an appropriate loss and optimizer (justify selection)
Track metrics (at least AUC and accuracy)
Prevent leakage by ensuring preprocessing is fit only on training folds within a pipeline
Evaluate on validation set
Perform simple hyperparameter tuning (e.g., grid/random search)
Retrain best model on train+validation and generate a test-set report
Make the pipeline reproducible (random seeds, configuration)
Structure the code to be modular and runnable from the command line
Explain how you would detect and fix common bugs (e.g., optimizer not stepping) and how to monitor training and overfitting

Assumptions

The CSV fits in memory
The label column name is provided via CLI
If the label is not numeric 0/1, you may encode it to 0/1
Use Python, pandas, scikit-learn; keep dependencies minimal

Build an end-to-end ML classification pipeline

Build a Reproducible End-to-End Binary Classification Pipeline for a CSV Dataset

Context

Requirements

Assumptions

Solution (Locked)

Comments (0)