This question evaluates a data scientist's competency in time-aware predictive modeling, exploratory data analysis for leakage and target drift detection, temporal cross-validation design, feature engineering and handling of rare categories and class imbalance, model selection (linear and tree-based), explainability and robustness testing, and deployment/experiment specification within the Machine Learning / Data Science domain. It is commonly asked because it probes practical application of production-ready ML workflows on temporally ordered data—assessing conceptual understanding of data leakage, drift, and validation alongside practical skills for metric selection, thresholding, inference contracts and operational reliability, so the level of abstraction spans both practical application and conceptual understanding.

You are building a binary classifier that predicts whether a domestic flight will arrive 15+ minutes late (late15 = 1 if arr_delay_min ≥ 15, else 0), using only information available by scheduled departure time.
You receive a 50M-row table with these columns (one row per scheduled flight):
Assume we must restrict features to those known by scheduled departure and align any aggregates/forecasts accordingly.
Login required