Build an end-to-end ML pipeline
Company: Amazon
Role: Machine Learning Engineer
Category: ML System Design
Difficulty: hard
Interview Round: Onsite
Given a CSV with shipment events (order_id, origin, destination, ship_date, promised_date, carrier, weight, item_count, scan_events[], delivered_date), build from scratch a Python pipeline that:
(
1) loads and validates data; handles missing values, outliers, and time zones;
(
2) creates features (e.g., day-of-week, route, carrier stats, dwell times from scan_events);
(
3) labels examples as delayed if delivered_date − promised_date > 48 hours (justify how you handle undelivered items and censoring);
(
4) trains a baseline model (logistic regression or gradient-boosted trees) with cross-validation; reports ROC-AUC and PR-AUC; addresses class imbalance;
(
5) calibrates probabilities and explains top features;
(
6) outputs a CSV of top-K at-risk shipments with calibrated probabilities and reason codes. Optimize for runtime < 5 minutes on 1M rows and memory < 4 GB, and discuss strategies to speed up training/inference and ensure reproducibility.
Quick Answer: Build an end-to-end ML pipeline evaluates ML product requirements, data/labeling, modeling, serving architecture, evaluation, monitoring, and trade-offs in a realistic interview setting. A strong answer states assumptions, handles edge cases, explains trade-offs, and shows how to validate the result clearly.