
You’re given an airline on‑time dataset and a one‑page “Metadata” slide that claims: flight_date (string, local time), dep_time/arr_time (HHMM local), dep_delay/arr_delay (minutes, negative allowed), cancelled (0/1), tail_num (aircraft id), origin, dest, aircraft_type (optional), weather_code (optional). The modeling target will be arrival delay in minutes. Audit data quality before modeling: 1) List at least 8 concrete, high‑risk issues you would proactively test for (e.g., timezone mixups, impossible HHMM values like 2460, clock changes around DST, negative delays beyond physical plausibility, cancelled flights with non‑null delays, duplicated tail_num semantics, leakage via “arr_delay” filled post‑facto, inconsistent units). 2) For each, specify an exact detection check (e.g., dep_time not matching ^\d{3,4}$ OR casted minutes >= 24*60; arr_delay < -60 OR > 600; origin/dest not in allowed IATA set) and how you’d quantify impact (prevalence, row counts, feature coverage). 3) Propose mitigation that avoids target leakage (e.g., remove post‑arrival features, impute within training folds, align to UTC with airport TZ map, drop DST transition minutes) and describe how you’d verify the fixes with before/after distributions and holdout data. 4) Define a lightweight data contract for upstream producers (types, ranges, allowed nulls, semantic rules) and how you’d enforce it in CI with schema tests and sampling.