Audit flight data quality from metadata

Q: Audit flight data quality from metadata

This is a Data Manipulation (SQL/Python) interview question from Capital One for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

Question

You’re given an airline on‑time dataset and a one‑page “Metadata” slide that claims: flight_date (string, local time), dep_time/arr_time (HHMM local), dep_delay/arr_delay (minutes, negative allowed), cancelled (0/1), tail_num (aircraft id), origin, dest, aircraft_type (optional), weather_code (optional). The modeling target will be arrival delay in minutes. Audit data quality before modeling: 1) List at least 8 concrete, high‑risk issues you would proactively test for (e.g., timezone mixups, impossible HHMM values like 2460, clock changes around DST, negative delays beyond physical plausibility, cancelled flights with non‑null delays, duplicated tail_num semantics, leakage via “arr_delay” filled post‑facto, inconsistent units). 2) For each, specify an exact detection check (e.g., dep_time not matching ^\d{3,4}$ OR casted minutes >= 24*60; arr_delay < -60 OR > 600; origin/dest not in allowed IATA set) and how you’d quantify impact (prevalence, row counts, feature coverage). 3) Propose mitigation that avoids target leakage (e.g., remove post‑arrival features, impute within training folds, align to UTC with airport TZ map, drop DST transition minutes) and describe how you’d verify the fixes with before/after distributions and holdout data. 4) Define a lightweight data contract for upstream producers (types, ranges, allowed nulls, semantic rules) and how you’d enforce it in CI with schema tests and sampling.

Audit flight data quality from metadata

Comments (0)