Determine Drivers of Airline Flight Delays
Context
You are analyzing a flight-level dataset to identify which factors most impact delays. Assume you have one row per flight with columns such as:
-
delay_min (arrival delay in minutes; can be 0+ and skewed)
-
delayed_15 (binary: delay_min ≥ 15)
-
carrier, origin, destination, aircraft_type
-
scheduled_departure_hour, day_of_week, month (seasonality), holiday
-
route_distance, precipitation, wind, visibility (origin/destination weather)
-
flight_date (for time-aware validation)
Task
Propose a statistical approach to determine which factors most impact delays. Specifically:
-
Choose appropriate outcome(s) and justify them.
-
Specify models and hypothesis tests you would use to quantify factor impacts.
-
Detail how you would validate model assumptions and guard against common pitfalls (e.g., seasonality, heteroskedasticity).
-
Explain how you would interpret results and report uncertainty.
Hints: Consider regression, ANOVA/Type II/III tests, confidence intervals, seasonality modeling, heteroskedasticity checks, and time-aware validation.