Audit flight data quality from metadata
Company: Capital One
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Onsite
You’re given an airline on‑time dataset and a one‑page “Metadata” slide that claims: flight_date (string, local time), dep_time/arr_time (HHMM local), dep_delay/arr_delay (minutes, negative allowed), cancelled (0/1), tail_num (aircraft id), origin, dest, aircraft_type (optional), weather_code (optional). The modeling target will be arrival delay in minutes. Audit data quality before modeling: 1) List at least 8 concrete, high‑risk issues you would proactively test for (e.g., timezone mixups, impossible HHMM values like 2460, clock changes around DST, negative delays beyond physical plausibility, cancelled flights with non‑null delays, duplicated tail_num semantics, leakage via “arr_delay” filled post‑facto, inconsistent units). 2) For each, specify an exact detection check (e.g., dep_time not matching ^\d{3,4}$ OR casted minutes >= 24*60; arr_delay < -60 OR > 600; origin/dest not in allowed IATA set) and how you’d quantify impact (prevalence, row counts, feature coverage). 3) Propose mitigation that avoids target leakage (e.g., remove post‑arrival features, impute within training folds, align to UTC with airport TZ map, drop DST transition minutes) and describe how you’d verify the fixes with before/after distributions and holdout data. 4) Define a lightweight data contract for upstream producers (types, ranges, allowed nulls, semantic rules) and how you’d enforce it in CI with schema tests and sampling.
Quick Answer: This question evaluates data quality auditing competencies, including validation of timestamps and timezones, detection of label leakage and inconsistent units, provenance and schema enforcement, and hands‑on use of SQL/Python for dataset inspection.