Flight Delay Modeling: Binary Target, Features, and Diagnostics
You are modeling the probability that a flight arrives with a delay greater than 15 minutes (binary target: 1 if delay > 15 min, else 0). The current feature set includes:
-
day_of_week (encoded as integers 1–7),
-
flight_seats (contains negative values due to ETL errors),
-
several correlated operational variables (e.g., turnaround_time, taxi_out, gate_occupancy).
The team mistakenly fit an OLS regression to this binary target.
Tasks
-
Target/model choice
-
Explain rigorously why OLS is inappropriate for a binary target and select a correct alternative.
-
For your choice, specify the link function, assumptions, and how you’d check them.
-
Encoding for day_of_week
-
Show why treating day_of_week as numeric can bias estimates.
-
Propose an appropriate encoding and a quick statistical test to assess day effects.
-
Data quality
-
Propose a principled treatment for negative seat counts and missing values.
-
Quantify the impact of different strategies on variance and bias.
-
Multicollinearity
-
Define the Variance Inflation Factor (VIF) and derive VIF = 1/(1 − R_j^2).
-
If VIF for turnaround_time is 12, interpret this value and list three remedies (and their trade-offs), including regularization.
-
Explain how standardization affects coefficient interpretation and multicollinearity diagnostics.
-
Evaluation
-
Choose metrics aligned with the binary target under class imbalance.
-
Describe a time-based cross-validation scheme to avoid temporal leakage.
-
Explain how you would calibrate predicted probabilities.