You are given several CSVs for the classic airline delay challenge with columns like flight_date, carrier, flight_num, origin, dest, sched_dep, sched_arr, dep_delay_min, arr_delay_min, distance, aircraft_type, weather_features_*, and holiday_flag. a) Define a binary target and justify it: e.g., late_arrival = arr_delay_min > 15. b) Detail a leakage-aware feature set: include weather forecasts at origin/dest, route history aggregates up to t−7 days, time-of-day, day-of-week, month, distance, carrier- and airport-level rolling stats; exclude or properly lag any features that encode future information (e.g., actual arrival times). c) Specify a time-based split (e.g., train up to 2024-06, validate 2024-07–2024-09, test 2024-10–2025-03), class imbalance handling, and primary metrics (PR-AUC, calibrated Brier). d) Compare a strong baseline (regularized logistic regression with target encoding) versus gradient boosting (e.g., XGBoost/LightGBM): hyperparameters to search, early stopping, monotonic constraints if used. e) Explain how you would do rolling-origin cross-validation and backtesting of threshold policies (e.g., proactive swaps or buffers) with cost-sensitive evaluation that prices false negatives at 5× false positives. f) Productionization: 20 ms/flight latency budget, 50 MB model size, feature store vs on-the-fly aggregation, drift detection, and periodic retraining cadence. g) Deliverables: reproducible notebook, clean data pipeline, model cards with fairness slices across carriers/airports, and an exec summary with recommended operational policy and estimated ROI.
Quick Answer: This question evaluates a data scientist's machine learning competencies including target definition, leakage-aware feature engineering, temporal splitting and backtesting, model comparison and hyperparameter tuning, cost-sensitive evaluation, and production constraints such as latency, model size, monitoring, and retraining.
Solution
### Framing the problem
The business goal drives everything: predict, *before departure*, whether a flight will arrive meaningfully late, so operations can act (rebook crews, pre-position aircraft, warn passengers, add buffer). That framing forces two disciplines throughout: the model may only use information knowable at *prediction time* (a fixed horizon before scheduled departure), and the evaluation must mimic how the model would have been used historically. Both points are where most airline-delay solutions quietly break via leakage, so I treat leakage as the central risk, not an afterthought.
I'll assume the prediction is made at a fixed cutoff — e.g., **2 hours before scheduled departure** — and freeze "what is knowable" to that instant.
---
### a) Target definition and justification
**Primary target:** `late_arrival = (arr_delay_min > 15)`, a binary label.
Justification:
- **Operationally meaningful, not arbitrary.** The 15-minute threshold is the long-standing industry convention for an "on-time" arrival (the DOT/BTS on-time definition), so the label maps to a metric the business already reports and is held to. Predicting it produces something stakeholders can act on and benchmark.
- **Binary beats raw regression for the use case.** We could regress `arr_delay_min` directly, but the decision (intervene or not) is a threshold decision, and the delay distribution is heavy-tailed and zero-inflated (most flights are on time, a long right tail of severe delays). A classifier calibrated to "P(late)" is easier to threshold against a cost policy than a noisy minute-level regression.
- **Watch the label's edge cases.** `arr_delay_min` must be the *actual* realized arrival delay used only for labeling, never as a feature. Cancelled/diverted flights have no `arr_delay_min`; decide explicitly — typically treat a cancellation as a positive ("did not arrive on time") if the downstream cost is similar, or model cancellation separately. Document the choice; silently dropping cancellations biases the label toward optimism.
**Secondary targets worth defining for richer policy work (optional, mention but don't over-build):**
- A multi-class / ordinal version: on-time / minor (15–60 min) / major (>60 min), because the *cost* of a delay is non-linear.
- A regression head on `arr_delay_min` (e.g., quantile regression at the 0.5/0.9 quantiles) if ops wants an expected-buffer estimate, not just a flag.
I'd lead with the binary classifier and keep the ordinal/quantile variant as a stretch deliverable.
---
### b) Leakage-aware feature set
The governing rule: **every feature must be reconstructable from data available at the prediction cutoff (2h pre-departure).** I group features by source and explicitly state the lag.
**Schedule / static (known at booking time — safe):**
- `distance`, `aircraft_type`, `carrier`, `origin`, `dest`, scheduled block time = `sched_arr − sched_dep`.
- Time encodings: hour-of-day of `sched_dep` (cyclical sin/cos), `day_of_week`, `month`, `holiday_flag`, and a "day before/after holiday" flag. Cyclical encoding avoids the artificial 23→0 discontinuity.
- Route = `(origin, dest)` and directional flag.
**Weather — forecasts only, never actuals.** Use the *forecast* for the departure/arrival window issued *before* the cutoff (this is what `weather_features_*` should represent at serving time). Using realized weather at the actual arrival time is leakage. Concretely: forecasted precipitation, wind, ceiling/visibility, convective probability at origin and dest for the scheduled departure and arrival hours.
**Historical aggregates — strictly lagged to `t − 7 days` or earlier:**
- Route-level: mean/median `arr_delay_min`, P(late), variance over the route's flights in a trailing window (e.g., last 7/28 days), as of `t−7`.
- Carrier-level and airport-level rolling delay rates (origin departure-delay rate, dest arrival-delay rate) over trailing windows.
- Carrier×airport and aircraft-type rolling stats for fleet/station effects.
- **Critical leakage trap:** these aggregates must be computed with an *expanding/rolling window that ends strictly before the row's own date*, not over the whole training set. Computing a route's mean delay over all dates (including future) and joining it back is the most common leakage bug here and inflates offline metrics by a lot.
**Same-day upstream propagation (the highest-signal feature, but the trickiest):**
- The single biggest driver of arrival delay is whether the *inbound* aircraft and crew are already late. **Schema caveat:** the listed columns have no tail number / registration, and `aircraft_type` is an aircraft *class* (e.g., B738), not an aircraft identifier — so you cannot link the exact physical inbound airframe from these columns alone. The cleanest proxy the schema *does* support is the prior leg flown under the same `carrier` + `flight_num` earlier the same `flight_date`, or a same-day leg arriving into this flight's `origin` chained by `(carrier, dest=origin, sched_arr ≈ this flight's sched_dep − turnaround)`. Use that proxy, and flag that a true tail-rotation linkage would require adding a registration column to the data.
- Given a linkable prior leg, its *current departure delay* — observed as of the cutoff — is enormously predictive and **legitimate**, because at 2h before departure we genuinely know that earlier leg's status. Include `inbound_dep_delay_so_far`, `inbound_in_air_flag`, `turnaround_buffer = sched_dep − inbound_sched_arr` (all derived from the proxy linkage above).
- If the cutoff is *before* the inbound leg has departed, then you only have its forecast, not its realized delay — encode that honestly (use the inbound's own predicted P(late), or mark as unknown).
**Explicitly excluded (future-encoding) features:**
- `arr_delay_min` (the label), actual arrival time, actual taxi/airborne times, realized en-route weather, `dep_delay_min` of *this same flight* if the cutoff precedes departure (we don't yet know it). If the cutoff is *after* pushback you could use realized `dep_delay_min`, but then state that explicitly — it changes the product.
**Encoding plan:** high-cardinality categoricals (`origin`, `dest`, `carrier`, route, aircraft_type) via **target/mean encoding computed inside the CV fold with smoothing and out-of-fold predictions** to avoid target leakage; or native categorical handling for the GBM (LightGBM/XGBoost). Never fit the target encoder on the full data before splitting.
---
### c) Time-based split, imbalance handling, metrics
**Split — strictly temporal, no shuffling.** Random k-fold is invalid here: it leaks the future into the past and lets near-duplicate same-day flights straddle folds. Use the proposed scheme:
- Train: through 2024-06
- Validation (model selection, early stopping, calibration, threshold): 2024-07 → 2024-09
- Test (reported once, untouched): 2024-10 → 2025-03
Add a small **embargo/gap** (e.g., drop the few days straddling each boundary) so trailing-window features computed near the boundary don't peek across it. Recompute all rolling aggregates *within* each split's own causal window.
**Class imbalance.** "Late" is the minority (roughly a quarter or less of flights, depending on threshold/season — state qualitatively, don't quote a fixed number). Approach in priority order:
1. **Do nothing to the data first** — GBMs handle moderate imbalance fine. Set `scale_pos_weight ≈ (#neg / #pos)` (XGBoost) or `is_unbalance`/`class_weight` (LightGBM), or class weights in logistic regression. This adjusts the loss, not the label prior.
2. **Avoid naive oversampling/SMOTE** for this problem: it distorts calibration (which we care about — see Brier) and SMOTE-ing time-series rows breaks temporal structure. If used, only on the training fold, and recalibrate afterward.
3. **Optimize a threshold on the cost curve, not 0.5** (part e), and **calibrate probabilities** (isotonic or Platt on the validation set) so the cost-sensitive threshold is meaningful.
**Primary metrics:**
- **PR-AUC (average precision)** — the right ranking metric under imbalance; ROC-AUC is overly flattering when negatives dominate. PR-AUC focuses on the positive (late) class we care about.
- **Calibrated Brier score** + a reliability diagram — because the downstream policy uses *probabilities*, not just rankings; a well-ranked but mis-calibrated model makes the cost-optimal threshold wrong.
- **Secondary:** recall at a fixed operational precision (e.g., recall @ precision=0.5), ROC-AUC for continuity with prior reporting, and the **expected cost** per the 5:1 cost ratio (part e) as the ultimate business metric.
---
### d) Baseline (regularized logistic regression) vs gradient boosting
**Baseline — regularized logistic regression with target encoding.**
- Pipeline: out-of-fold smoothed target encoding for high-cardinality categoricals, one-hot for low-cardinality, standardize numerics, `LogisticRegression` with L2 (or elastic-net via `saga`).
- Hyperparameters to search: inverse-reg strength `C` (log-grid, e.g., $10^{-3}$ to $10^{2}$), penalty (L2 vs elastic-net `l1_ratio`), class weight. Small search; LR is cheap.
- Value: fast, fully interpretable coefficients, a calibration-friendly probabilistic baseline, and a sanity floor. If the GBM can't clearly beat a well-tuned LR, something's wrong (often leakage in the GBM, or no real signal).
**Gradient boosting — XGBoost / LightGBM.**
- Why it should win: non-linear interactions (route × weather × time-of-day × inbound delay) are exactly what trees capture, and native categorical/missing handling fits this messy data.
- Hyperparameters to search (Bayesian/Optuna over a temporal validation split, not random CV):
- `num_leaves` / `max_depth` (capacity), `learning_rate` (small, e.g., 0.03–0.1) paired with **early stopping** on validation PR-AUC/logloss to choose `n_estimators` — never fix the tree count by hand.
- `min_child_weight` / `min_data_in_leaf` (regularize against tiny noisy leaves), `subsample` (`bagging_fraction`), `colsample_bytree` (`feature_fraction`), `reg_alpha`/`reg_lambda`.
- `scale_pos_weight` for imbalance.
- **Monotonic constraints — use them deliberately.** Domain priors that hold monotonically: higher forecasted precipitation/wind → higher P(late); larger inbound delay → higher P(late); shorter turnaround buffer → higher P(late). Encoding these as monotone constraints buys robustness, easier stakeholder trust, and guards against weird non-monotone fits in sparse regions. *Don't* constrain features where the relationship is genuinely non-monotone (e.g., hour-of-day).
- **Calibration:** GBMs are often mildly mis-calibrated; fit isotonic regression on the validation set after training, then report Brier on test.
**Decision rule:** pick by validation expected cost (5:1) and Brier, with PR-AUC as a tiebreaker — favor the GBM only if its *cost* advantage survives calibration and holds on the untouched test window. Keep the LR as the interpretable fallback and as a monitored shadow.
---
### e) Rolling-origin CV, backtesting threshold policies, cost-sensitive eval
**Rolling-origin (walk-forward) cross-validation.** Instead of a single train/val cut, slide the origin forward to test stability across regimes (summer thunderstorms vs winter ops). Concretely, several expanding-window folds:
| Fold | Train through | Validate |
|------|---------------|----------|
| 1 | 2024-03 | 2024-04 |
| 2 | 2024-04 | 2024-05 |
| 3 | 2024-05 | 2024-06 |
| … | … | … |
Always train-before-validate, recompute features causally per fold, and report mean ± variance of PR-AUC / cost across folds. High variance flags a model that's fragile to seasonality — important to surface before it hits production.
**Cost-sensitive evaluation.** Define the confusion costs from the brief: a false negative (we said on-time, flight was late → no proactive action, expensive downstream recovery) costs **5×** a false positive (we said late, it wasn't → wasted buffer/swap). Expected cost per flight:
$$\text{Cost} = c_{FN}\cdot FN + c_{FP}\cdot FP, \quad c_{FN} = 5\,c_{FP}.$$
The cost-optimal threshold on calibrated probability $p$ follows directly. Acting when expected cost of acting < cost of not acting gives
$$t^{*} = \frac{c_{FP}}{c_{FP}+c_{FN}}.$$
Normalizing to $c_{FP}=1,\ c_{FN}=5$ (the brief's 5:1 ratio):
$$t^{*} = \frac{c_{FP}}{c_{FP}+c_{FN}} = \frac{1}{1+5} = \frac{1}{6} \approx 0.167,$$
i.e., intervene whenever $P(\text{late}) > \approx 0.17$, *not* 0.5. This is exactly why calibration matters — a wrong probability scale moves $t^*$ to the wrong place. (If true costs vary by route/aircraft, make $c_{FN}, c_{FP}$ features of the decision rather than global constants.)
**Backtesting threshold policies.** Beyond a single threshold, backtest *operational policies* on held-out months, replaying them as if live:
- **Proactive-swap policy:** above $t^*$, trigger an aircraft/crew swap or rebook hold; price the swap cost vs avoided downstream delay.
- **Buffer policy:** add schedule/turnaround buffer to high-risk flights; price idle-aircraft cost vs delay savings.
- For each policy, on each backtest month compute realized expected cost, number of interventions, and "regret" vs an oracle. Sweep the threshold to produce a cost-vs-threshold curve and pick the operating point that minimizes expected cost subject to an intervention-budget cap (ops can only swap so many flights/day). Report the policy's cost *delta vs the current/no-model baseline* — that delta is the ROI input for part g.
---
### f) Productionization
**Latency budget (20 ms/flight).** Easily met by a GBM at inference if feature assembly is fast — the dominant cost is the *feature joins*, not the tree traversal. Keep the model to a few hundred shallow trees; serve with the native predictor (LightGBM/XGBoost C++ or an ONNX/Treelite-compiled tree) which scores a single row in well under a millisecond. Budget the rest for feature lookup.
**Model size (50 MB).** A tuned GBM serializes to single-digit-to-tens of MB; if it exceeds budget, cap `num_leaves`/`n_estimators` or compile with Treelite. Comfortably under 50 MB — note size in the model card.
**Feature store vs on-the-fly.**
- **Precompute the slow, lagged aggregates** (route/carrier/airport rolling stats as of `t−7`) in a **feature store** with a daily batch job, served from a low-latency key-value store keyed by (route, date), (carrier, date), (airport, date). These can't be computed in 20 ms and don't change intra-day.
- **Compute on the fly** the cheap, request-time signals: time encodings, distance, weather *forecast* pulled at request time, and the **inbound-leg status** (which is live and the whole point of predicting at the cutoff).
- **Guarantee train/serve parity:** the same code path computes features offline and online (or generates from one definition), and the offline aggregates respect the exact `t−7` causal cut used in training. Parity bugs here are the production analog of the training-time leakage in (b).
**Drift detection.**
- *Input drift:* PSI / KS tests on key feature distributions (weather forecast distributions, route mix, carrier mix) vs the training reference; alert on shift.
- *Prediction drift:* monitor the distribution of predicted P(late) and the intervention rate.
- *Performance & calibration drift:* once actual `arr_delay_min` lands (label arrives hours later), compute rolling PR-AUC, Brier, and reliability; alert when calibration degrades. Schedule shocks (new routes, schedule changes, irregular ops/IRROPS, weather seasons) are the realistic drift drivers.
**Retraining cadence.** Given strong seasonality and schedule churn, **retrain on a regular cadence (e.g., monthly) with automatic re-calibration**, plus **event-triggered retraining** when drift/calibration alerts fire or after major schedule changes. Always validate a candidate on the most recent untouched window with the rolling-origin protocol and the cost metric before promotion; ship behind a champion/challenger shadow so a bad model never auto-promotes.
---
### g) Deliverables
- **Reproducible notebook + clean pipeline.** One pinned-environment notebook that runs end-to-end from raw CSVs → cleaned, joined dataset → features → model → evaluation, with the data pipeline factored into importable modules (not just notebook cells) so offline and serving share feature code. Deterministic seeds; cached intermediate artifacts; the temporal split and embargo encoded in config, not hard-coded.
- **Model card.** Intended use and prediction cutoff; training window; features and their lags; metrics (PR-AUC, calibrated Brier, expected cost at 5:1) with confidence from the rolling-origin folds; **fairness/robustness slices across carriers and airports** — report per-carrier and per-airport PR-AUC, calibration, and false-negative rate, because a model that's accurate overall but systematically under-predicts delays at small airports or for one carrier creates uneven operational harm. Flag and, if needed, mitigate slices where the FN rate (the costly error) is materially worse.
- **Exec summary.** One page: recommended operating threshold ($\approx 0.17$ on calibrated probability under the stated cost ratio) and recommended policy (e.g., proactive-swap/buffer above threshold subject to a daily intervention budget); the **backtested cost delta vs no-model baseline** translated into estimated annual savings (state the ROI as a transparent function of the cost assumptions $c_{FN}, c_{FP}$ and intervention volume, not a single fabricated number); limitations (weather-forecast quality, IRROPS regimes the model wasn't trained on, fairness slices to watch); and a monitoring/retraining plan.
---
### What a strong answer demonstrates (and pitfalls)
- **Leakage discipline is the whole game** here: anchor every feature to a fixed prediction cutoff, lag aggregates causally, compute target encodings out-of-fold, split temporally with an embargo. Most weak answers leak via whole-dataset aggregates or random CV and report unrealistically good metrics.
- **Calibration + cost-sensitivity, not accuracy.** The decision is a 5:1 cost threshold on a *probability*; accuracy/ROC-AUC alone are the wrong scoreboard.
- **The inbound-aircraft/propagation feature** is the highest-leverage signal and the cleanest test of whether the candidate understands the domain and the cutoff semantics.
- **Train/serve parity and honest ROI** (a function of assumptions, not an invented figure) separate a production-credible answer from a notebook-only one.