Build an end-to-end ML pipeline

Q: Build an end-to-end ML pipeline

This is a ML System Design interview question from Amazon for Machine Learning Engineer roles. View the full question and solution on PracHub.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

ML System Design: Shipment Delay Risk Scoring From a Single CSV

You are given a CSV of shipment events with the following columns:

order_id (string)
origin (string)
destination (string)
ship_date (string/datetime)
promised_date (string/datetime)
carrier (string)
weight (float)
item_count (int)
scan_events (JSON array encoded as string; each element typically has a timestamp and status)
delivered_date (string/datetime; may be null if undelivered)

Build a Python pipeline from scratch that:

Loads and validates data, handling missing values, outliers, and time zones.
Creates features (e.g., day-of-week, route, carrier stats via target encoding, and dwell times from scan_events).
Labels examples as delayed if delivered_date − promised_date > 48 hours. Justify and implement how you handle undelivered items and censoring.
Trains a baseline model (logistic regression or gradient-boosted trees) with cross-validation; reports ROC-AUC and PR-AUC; addresses class imbalance.
Calibrates probabilities and explains top features.
Outputs a CSV of top-K at-risk shipments with calibrated probabilities and reason codes.

Constraints:

Optimize for runtime < 5 minutes on 1M rows and memory < 4 GB on CPU.
Discuss strategies to speed up training/inference and ensure reproducibility.

Build an end-to-end ML pipeline

ML System Design: Shipment Delay Risk Scoring From a Single CSV

Solution (Locked)

Comments (0)