PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/Machine Learning/Capital One

Build and evaluate airline delay prediction model

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a data scientist's machine learning competencies including target definition, leakage-aware feature engineering, temporal splitting and backtesting, model comparison and hyperparameter tuning, cost-sensitive evaluation, and production constraints such as latency, model size, monitoring, and retraining.

  • Medium
  • Capital One
  • Machine Learning
  • Data Scientist

Build and evaluate airline delay prediction model

Company: Capital One

Role: Data Scientist

Category: Machine Learning

Difficulty: Medium

Interview Round: Technical Screen

You are given several CSVs for the classic airline delay challenge with columns like flight_date, carrier, flight_num, origin, dest, sched_dep, sched_arr, dep_delay_min, arr_delay_min, distance, aircraft_type, weather_features_*, and holiday_flag. a) Define a binary target and justify it: e.g., late_arrival = arr_delay_min > 15. b) Detail a leakage-aware feature set: include weather forecasts at origin/dest, route history aggregates up to t−7 days, time-of-day, day-of-week, month, distance, carrier- and airport-level rolling stats; exclude or properly lag any features that encode future information (e.g., actual arrival times). c) Specify a time-based split (e.g., train up to 2024-06, validate 2024-07–2024-09, test 2024-10–2025-03), class imbalance handling, and primary metrics (PR-AUC, calibrated Brier). d) Compare a strong baseline (regularized logistic regression with target encoding) versus gradient boosting (e.g., XGBoost/LightGBM): hyperparameters to search, early stopping, monotonic constraints if used. e) Explain how you would do rolling-origin cross-validation and backtesting of threshold policies (e.g., proactive swaps or buffers) with cost-sensitive evaluation that prices false negatives at 5× false positives. f) Productionization: 20 ms/flight latency budget, 50 MB model size, feature store vs on-the-fly aggregation, drift detection, and periodic retraining cadence. g) Deliverables: reproducible notebook, clean data pipeline, model cards with fairness slices across carriers/airports, and an exec summary with recommended operational policy and estimated ROI.

Quick Answer: This question evaluates a data scientist's machine learning competencies including target definition, leakage-aware feature engineering, temporal splitting and backtesting, model comparison and hyperparameter tuning, cost-sensitive evaluation, and production constraints such as latency, model size, monitoring, and retraining.

Solution

### Framing the problem The business goal drives everything: predict, *before departure*, whether a flight will arrive meaningfully late, so operations can act (rebook crews, pre-position aircraft, warn passengers, add buffer). That framing forces two disciplines throughout: the model may only use information knowable at *prediction time* (a fixed horizon before scheduled departure), and the evaluation must mimic how the model would have been used historically. Both points are where most airline-delay solutions quietly break via leakage, so I treat leakage as the central risk, not an afterthought. I'll assume the prediction is made at a fixed cutoff — e.g., **2 hours before scheduled departure** — and freeze "what is knowable" to that instant. --- ### a) Target definition and justification **Primary target:** `late_arrival = (arr_delay_min > 15)`, a binary label. Justification: - **Operationally meaningful, not arbitrary.** The 15-minute threshold is the long-standing industry convention for an "on-time" arrival (the DOT/BTS on-time definition), so the label maps to a metric the business already reports and is held to. Predicting it produces something stakeholders can act on and benchmark. - **Binary beats raw regression for the use case.** We could regress `arr_delay_min` directly, but the decision (intervene or not) is a threshold decision, and the delay distribution is heavy-tailed and zero-inflated (most flights are on time, a long right tail of severe delays). A classifier calibrated to "P(late)" is easier to threshold against a cost policy than a noisy minute-level regression. - **Watch the label's edge cases.** `arr_delay_min` must be the *actual* realized arrival delay used only for labeling, never as a feature. Cancelled/diverted flights have no `arr_delay_min`; decide explicitly — typically treat a cancellation as a positive ("did not arrive on time") if the downstream cost is similar, or model cancellation separately. Document the choice; silently dropping cancellations biases the label toward optimism. **Secondary targets worth defining for richer policy work (optional, mention but don't over-build):** - A multi-class / ordinal version: on-time / minor (15–60 min) / major (>60 min), because the *cost* of a delay is non-linear. - A regression head on `arr_delay_min` (e.g., quantile regression at the 0.5/0.9 quantiles) if ops wants an expected-buffer estimate, not just a flag. I'd lead with the binary classifier and keep the ordinal/quantile variant as a stretch deliverable. --- ### b) Leakage-aware feature set The governing rule: **every feature must be reconstructable from data available at the prediction cutoff (2h pre-departure).** I group features by source and explicitly state the lag. **Schedule / static (known at booking time — safe):** - `distance`, `aircraft_type`, `carrier`, `origin`, `dest`, scheduled block time = `sched_arr − sched_dep`. - Time encodings: hour-of-day of `sched_dep` (cyclical sin/cos), `day_of_week`, `month`, `holiday_flag`, and a "day before/after holiday" flag. Cyclical encoding avoids the artificial 23→0 discontinuity. - Route = `(origin, dest)` and directional flag. **Weather — forecasts only, never actuals.** Use the *forecast* for the departure/arrival window issued *before* the cutoff (this is what `weather_features_*` should represent at serving time). Using realized weather at the actual arrival time is leakage. Concretely: forecasted precipitation, wind, ceiling/visibility, convective probability at origin and dest for the scheduled departure and arrival hours. **Historical aggregates — strictly lagged to `t − 7 days` or earlier:** - Route-level: mean/median `arr_delay_min`, P(late), variance over the route's flights in a trailing window (e.g., last 7/28 days), as of `t−7`. - Carrier-level and airport-level rolling delay rates (origin departure-delay rate, dest arrival-delay rate) over trailing windows. - Carrier×airport and aircraft-type rolling stats for fleet/station effects. - **Critical leakage trap:** these aggregates must be computed with an *expanding/rolling window that ends strictly before the row's own date*, not over the whole training set. Computing a route's mean delay over all dates (including future) and joining it back is the most common leakage bug here and inflates offline metrics by a lot. **Same-day upstream propagation (the highest-signal feature, but the trickiest):** - The single biggest driver of arrival delay is whether the *inbound* aircraft and crew are already late. **Schema caveat:** the listed columns have no tail number / registration, and `aircraft_type` is an aircraft *class* (e.g., B738), not an aircraft identifier — so you cannot link the exact physical inbound airframe from these columns alone. The cleanest proxy the schema *does* support is the prior leg flown under the same `carrier` + `flight_num` earlier the same `flight_date`, or a same-day leg arriving into this flight's `origin` chained by `(carrier, dest=origin, sched_arr ≈ this flight's sched_dep − turnaround)`. Use that proxy, and flag that a true tail-rotation linkage would require adding a registration column to the data. - Given a linkable prior leg, its *current departure delay* — observed as of the cutoff — is enormously predictive and **legitimate**, because at 2h before departure we genuinely know that earlier leg's status. Include `inbound_dep_delay_so_far`, `inbound_in_air_flag`, `turnaround_buffer = sched_dep − inbound_sched_arr` (all derived from the proxy linkage above). - If the cutoff is *before* the inbound leg has departed, then you only have its forecast, not its realized delay — encode that honestly (use the inbound's own predicted P(late), or mark as unknown). **Explicitly excluded (future-encoding) features:** - `arr_delay_min` (the label), actual arrival time, actual taxi/airborne times, realized en-route weather, `dep_delay_min` of *this same flight* if the cutoff precedes departure (we don't yet know it). If the cutoff is *after* pushback you could use realized `dep_delay_min`, but then state that explicitly — it changes the product. **Encoding plan:** high-cardinality categoricals (`origin`, `dest`, `carrier`, route, aircraft_type) via **target/mean encoding computed inside the CV fold with smoothing and out-of-fold predictions** to avoid target leakage; or native categorical handling for the GBM (LightGBM/XGBoost). Never fit the target encoder on the full data before splitting. --- ### c) Time-based split, imbalance handling, metrics **Split — strictly temporal, no shuffling.** Random k-fold is invalid here: it leaks the future into the past and lets near-duplicate same-day flights straddle folds. Use the proposed scheme: - Train: through 2024-06 - Validation (model selection, early stopping, calibration, threshold): 2024-07 → 2024-09 - Test (reported once, untouched): 2024-10 → 2025-03 Add a small **embargo/gap** (e.g., drop the few days straddling each boundary) so trailing-window features computed near the boundary don't peek across it. Recompute all rolling aggregates *within* each split's own causal window. **Class imbalance.** "Late" is the minority (roughly a quarter or less of flights, depending on threshold/season — state qualitatively, don't quote a fixed number). Approach in priority order: 1. **Do nothing to the data first** — GBMs handle moderate imbalance fine. Set `scale_pos_weight ≈ (#neg / #pos)` (XGBoost) or `is_unbalance`/`class_weight` (LightGBM), or class weights in logistic regression. This adjusts the loss, not the label prior. 2. **Avoid naive oversampling/SMOTE** for this problem: it distorts calibration (which we care about — see Brier) and SMOTE-ing time-series rows breaks temporal structure. If used, only on the training fold, and recalibrate afterward. 3. **Optimize a threshold on the cost curve, not 0.5** (part e), and **calibrate probabilities** (isotonic or Platt on the validation set) so the cost-sensitive threshold is meaningful. **Primary metrics:** - **PR-AUC (average precision)** — the right ranking metric under imbalance; ROC-AUC is overly flattering when negatives dominate. PR-AUC focuses on the positive (late) class we care about. - **Calibrated Brier score** + a reliability diagram — because the downstream policy uses *probabilities*, not just rankings; a well-ranked but mis-calibrated model makes the cost-optimal threshold wrong. - **Secondary:** recall at a fixed operational precision (e.g., recall @ precision=0.5), ROC-AUC for continuity with prior reporting, and the **expected cost** per the 5:1 cost ratio (part e) as the ultimate business metric. --- ### d) Baseline (regularized logistic regression) vs gradient boosting **Baseline — regularized logistic regression with target encoding.** - Pipeline: out-of-fold smoothed target encoding for high-cardinality categoricals, one-hot for low-cardinality, standardize numerics, `LogisticRegression` with L2 (or elastic-net via `saga`). - Hyperparameters to search: inverse-reg strength `C` (log-grid, e.g., $10^{-3}$ to $10^{2}$), penalty (L2 vs elastic-net `l1_ratio`), class weight. Small search; LR is cheap. - Value: fast, fully interpretable coefficients, a calibration-friendly probabilistic baseline, and a sanity floor. If the GBM can't clearly beat a well-tuned LR, something's wrong (often leakage in the GBM, or no real signal). **Gradient boosting — XGBoost / LightGBM.** - Why it should win: non-linear interactions (route × weather × time-of-day × inbound delay) are exactly what trees capture, and native categorical/missing handling fits this messy data. - Hyperparameters to search (Bayesian/Optuna over a temporal validation split, not random CV): - `num_leaves` / `max_depth` (capacity), `learning_rate` (small, e.g., 0.03–0.1) paired with **early stopping** on validation PR-AUC/logloss to choose `n_estimators` — never fix the tree count by hand. - `min_child_weight` / `min_data_in_leaf` (regularize against tiny noisy leaves), `subsample` (`bagging_fraction`), `colsample_bytree` (`feature_fraction`), `reg_alpha`/`reg_lambda`. - `scale_pos_weight` for imbalance. - **Monotonic constraints — use them deliberately.** Domain priors that hold monotonically: higher forecasted precipitation/wind → higher P(late); larger inbound delay → higher P(late); shorter turnaround buffer → higher P(late). Encoding these as monotone constraints buys robustness, easier stakeholder trust, and guards against weird non-monotone fits in sparse regions. *Don't* constrain features where the relationship is genuinely non-monotone (e.g., hour-of-day). - **Calibration:** GBMs are often mildly mis-calibrated; fit isotonic regression on the validation set after training, then report Brier on test. **Decision rule:** pick by validation expected cost (5:1) and Brier, with PR-AUC as a tiebreaker — favor the GBM only if its *cost* advantage survives calibration and holds on the untouched test window. Keep the LR as the interpretable fallback and as a monitored shadow. --- ### e) Rolling-origin CV, backtesting threshold policies, cost-sensitive eval **Rolling-origin (walk-forward) cross-validation.** Instead of a single train/val cut, slide the origin forward to test stability across regimes (summer thunderstorms vs winter ops). Concretely, several expanding-window folds: | Fold | Train through | Validate | |------|---------------|----------| | 1 | 2024-03 | 2024-04 | | 2 | 2024-04 | 2024-05 | | 3 | 2024-05 | 2024-06 | | … | … | … | Always train-before-validate, recompute features causally per fold, and report mean ± variance of PR-AUC / cost across folds. High variance flags a model that's fragile to seasonality — important to surface before it hits production. **Cost-sensitive evaluation.** Define the confusion costs from the brief: a false negative (we said on-time, flight was late → no proactive action, expensive downstream recovery) costs **5×** a false positive (we said late, it wasn't → wasted buffer/swap). Expected cost per flight: $$\text{Cost} = c_{FN}\cdot FN + c_{FP}\cdot FP, \quad c_{FN} = 5\,c_{FP}.$$ The cost-optimal threshold on calibrated probability $p$ follows directly. Acting when expected cost of acting < cost of not acting gives $$t^{*} = \frac{c_{FP}}{c_{FP}+c_{FN}}.$$ Normalizing to $c_{FP}=1,\ c_{FN}=5$ (the brief's 5:1 ratio): $$t^{*} = \frac{c_{FP}}{c_{FP}+c_{FN}} = \frac{1}{1+5} = \frac{1}{6} \approx 0.167,$$ i.e., intervene whenever $P(\text{late}) > \approx 0.17$, *not* 0.5. This is exactly why calibration matters — a wrong probability scale moves $t^*$ to the wrong place. (If true costs vary by route/aircraft, make $c_{FN}, c_{FP}$ features of the decision rather than global constants.) **Backtesting threshold policies.** Beyond a single threshold, backtest *operational policies* on held-out months, replaying them as if live: - **Proactive-swap policy:** above $t^*$, trigger an aircraft/crew swap or rebook hold; price the swap cost vs avoided downstream delay. - **Buffer policy:** add schedule/turnaround buffer to high-risk flights; price idle-aircraft cost vs delay savings. - For each policy, on each backtest month compute realized expected cost, number of interventions, and "regret" vs an oracle. Sweep the threshold to produce a cost-vs-threshold curve and pick the operating point that minimizes expected cost subject to an intervention-budget cap (ops can only swap so many flights/day). Report the policy's cost *delta vs the current/no-model baseline* — that delta is the ROI input for part g. --- ### f) Productionization **Latency budget (20 ms/flight).** Easily met by a GBM at inference if feature assembly is fast — the dominant cost is the *feature joins*, not the tree traversal. Keep the model to a few hundred shallow trees; serve with the native predictor (LightGBM/XGBoost C++ or an ONNX/Treelite-compiled tree) which scores a single row in well under a millisecond. Budget the rest for feature lookup. **Model size (50 MB).** A tuned GBM serializes to single-digit-to-tens of MB; if it exceeds budget, cap `num_leaves`/`n_estimators` or compile with Treelite. Comfortably under 50 MB — note size in the model card. **Feature store vs on-the-fly.** - **Precompute the slow, lagged aggregates** (route/carrier/airport rolling stats as of `t−7`) in a **feature store** with a daily batch job, served from a low-latency key-value store keyed by (route, date), (carrier, date), (airport, date). These can't be computed in 20 ms and don't change intra-day. - **Compute on the fly** the cheap, request-time signals: time encodings, distance, weather *forecast* pulled at request time, and the **inbound-leg status** (which is live and the whole point of predicting at the cutoff). - **Guarantee train/serve parity:** the same code path computes features offline and online (or generates from one definition), and the offline aggregates respect the exact `t−7` causal cut used in training. Parity bugs here are the production analog of the training-time leakage in (b). **Drift detection.** - *Input drift:* PSI / KS tests on key feature distributions (weather forecast distributions, route mix, carrier mix) vs the training reference; alert on shift. - *Prediction drift:* monitor the distribution of predicted P(late) and the intervention rate. - *Performance & calibration drift:* once actual `arr_delay_min` lands (label arrives hours later), compute rolling PR-AUC, Brier, and reliability; alert when calibration degrades. Schedule shocks (new routes, schedule changes, irregular ops/IRROPS, weather seasons) are the realistic drift drivers. **Retraining cadence.** Given strong seasonality and schedule churn, **retrain on a regular cadence (e.g., monthly) with automatic re-calibration**, plus **event-triggered retraining** when drift/calibration alerts fire or after major schedule changes. Always validate a candidate on the most recent untouched window with the rolling-origin protocol and the cost metric before promotion; ship behind a champion/challenger shadow so a bad model never auto-promotes. --- ### g) Deliverables - **Reproducible notebook + clean pipeline.** One pinned-environment notebook that runs end-to-end from raw CSVs → cleaned, joined dataset → features → model → evaluation, with the data pipeline factored into importable modules (not just notebook cells) so offline and serving share feature code. Deterministic seeds; cached intermediate artifacts; the temporal split and embargo encoded in config, not hard-coded. - **Model card.** Intended use and prediction cutoff; training window; features and their lags; metrics (PR-AUC, calibrated Brier, expected cost at 5:1) with confidence from the rolling-origin folds; **fairness/robustness slices across carriers and airports** — report per-carrier and per-airport PR-AUC, calibration, and false-negative rate, because a model that's accurate overall but systematically under-predicts delays at small airports or for one carrier creates uneven operational harm. Flag and, if needed, mitigate slices where the FN rate (the costly error) is materially worse. - **Exec summary.** One page: recommended operating threshold ($\approx 0.17$ on calibrated probability under the stated cost ratio) and recommended policy (e.g., proactive-swap/buffer above threshold subject to a daily intervention budget); the **backtested cost delta vs no-model baseline** translated into estimated annual savings (state the ROI as a transparent function of the cost assumptions $c_{FN}, c_{FP}$ and intervention volume, not a single fabricated number); limitations (weather-forecast quality, IRROPS regimes the model wasn't trained on, fairness slices to watch); and a monitoring/retraining plan. --- ### What a strong answer demonstrates (and pitfalls) - **Leakage discipline is the whole game** here: anchor every feature to a fixed prediction cutoff, lag aggregates causally, compute target encodings out-of-fold, split temporally with an embargo. Most weak answers leak via whole-dataset aggregates or random CV and report unrealistically good metrics. - **Calibration + cost-sensitivity, not accuracy.** The decision is a 5:1 cost threshold on a *probability*; accuracy/ROC-AUC alone are the wrong scoreboard. - **The inbound-aircraft/propagation feature** is the highest-leverage signal and the cleanest test of whether the candidate understands the domain and the cutoff semantics. - **Train/serve parity and honest ROI** (a function of assumptions, not an invented figure) separate a production-credible answer from a notebook-only one.

Related Interview Questions

  • Deep-dive XGBoost handling and overfitting - Capital One (medium)
  • Build House Price Model Responsibly - Capital One (easy)
  • Design robber detection from surveillance video - Capital One (easy)
  • How would you design delay and watchlist models? - Capital One (medium)
  • Explain core ML concepts and lifecycle - Capital One (medium)
|Home/Machine Learning/Capital One

Build and evaluate airline delay prediction model

Capital One logo
Capital One
Oct 13, 2025, 9:49 PM
MediumData ScientistTechnical ScreenMachine Learning
10
0

You are given several CSVs for the classic airline delay challenge with columns like flight_date, carrier, flight_num, origin, dest, sched_dep, sched_arr, dep_delay_min, arr_delay_min, distance, aircraft_type, weather_features_*, and holiday_flag. a) Define a binary target and justify it: e.g., late_arrival = arr_delay_min > 15. b) Detail a leakage-aware feature set: include weather forecasts at origin/dest, route history aggregates up to t−7 days, time-of-day, day-of-week, month, distance, carrier- and airport-level rolling stats; exclude or properly lag any features that encode future information (e.g., actual arrival times). c) Specify a time-based split (e.g., train up to 2024-06, validate 2024-07–2024-09, test 2024-10–2025-03), class imbalance handling, and primary metrics (PR-AUC, calibrated Brier). d) Compare a strong baseline (regularized logistic regression with target encoding) versus gradient boosting (e.g., XGBoost/LightGBM): hyperparameters to search, early stopping, monotonic constraints if used. e) Explain how you would do rolling-origin cross-validation and backtesting of threshold policies (e.g., proactive swaps or buffers) with cost-sensitive evaluation that prices false negatives at 5× false positives. f) Productionization: 20 ms/flight latency budget, 50 MB model size, feature store vs on-the-fly aggregation, drift detection, and periodic retraining cadence. g) Deliverables: reproducible notebook, clean data pipeline, model cards with fairness slices across carriers/airports, and an exec summary with recommended operational policy and estimated ROI.

Loading comments...

Browse More Questions

More Machine Learning•More Capital One•More Data Scientist•Capital One Data Scientist•Capital One Machine Learning•Data Scientist Machine Learning

Write your answer

Your first approved answer each day earns 20 XP.

Sign in to write your answer.
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.