Flight Delay Prediction — Data Quality, Modeling Choice, and Multicollinearity
Scenario
You have historical flight operations and weather data and need to build a model that predicts whether a flight will be delayed (e.g., more than 15 minutes late) at departure or arrival.
Assume you have tables such as: Flights (schedule, actuals, carrier, route), Weather (station, time, conditions), Airports (metadata), and possibly Air Traffic Control (ATC) constraints.
Tasks
-
Inspect the raw dataset and list likely data-quality issues you would check for and expect to find.
-
Choose a modeling framework and justify classification versus regression for the stated outcome.
-
Variance Inflation Factors (VIF) indicate high multicollinearity. Describe how you would diagnose and mitigate multicollinearity when presenting to another data scientist.
-
In an ideal setting, you can run an experiment—outline an experimental design to help confirm or resolve the multicollinearity issue.
Hints: Mention imputation, data validation, one-hot encoding, feature selection, regularization, variance inflation factors, and A/B or switchback tests.