ML System Design: Shipment Delay Risk Scoring From a Single CSV
You are given a CSV of shipment events with the following columns:
-
order_id (string)
-
origin (string)
-
destination (string)
-
ship_date (string/datetime)
-
promised_date (string/datetime)
-
carrier (string)
-
weight (float)
-
item_count (int)
-
scan_events (JSON array encoded as string; each element typically has a timestamp and status)
-
delivered_date (string/datetime; may be null if undelivered)
Build a Python pipeline from scratch that:
-
Loads and validates data, handling missing values, outliers, and time zones.
-
Creates features (e.g., day-of-week, route, carrier stats via target encoding, and dwell times from scan_events).
-
Labels examples as delayed if delivered_date − promised_date > 48 hours. Justify and implement how you handle undelivered items and censoring.
-
Trains a baseline model (logistic regression or gradient-boosted trees) with cross-validation; reports ROC-AUC and PR-AUC; addresses class imbalance.
-
Calibrates probabilities and explains top features.
-
Outputs a CSV of top-K at-risk shipments with calibrated probabilities and reason codes.
Constraints:
-
Optimize for runtime < 5 minutes on 1M rows and memory < 4 GB on CPU.
-
Discuss strategies to speed up training/inference and ensure reproducibility.