Audit flight data quality from metadata

Q: How do I practice SQL interview questions?

PracHub provides an interactive SQL console where you can write and test queries against real database schemas. Get instant feedback and compare your solution with the expected output.

Q: What difficulty level is this coding question?

This is a medium difficulty Data Manipulation (SQL/Python) question, commonly asked during Onsite rounds at Capital One.

Q: What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Capital One during technical interviews.

Question

You’re given an airline on‑time dataset and a one‑page “Metadata” slide that claims: flight_date (string, local time), dep_time/arr_time (HHMM local), dep_delay/arr_delay (minutes, negative allowed), cancelled (0/1), tail_num (aircraft id), origin, dest, aircraft_type (optional), weather_code (optional). The modeling target will be arrival delay in minutes. Audit data quality before modeling: 1) List at least 8 concrete, high‑risk issues you would proactively test for (e.g., timezone mixups, impossible HHMM values like 2460, clock changes around DST, negative delays beyond physical plausibility, cancelled flights with non‑null delays, duplicated tail_num semantics, leakage via “arr_delay” filled post‑facto, inconsistent units). 2) For each, specify an exact detection check (e.g., dep_time not matching ^\d{3,4}$ OR casted minutes >= 24*60; arr_delay < -60 OR > 600; origin/dest not in allowed IATA set) and how you’d quantify impact (prevalence, row counts, feature coverage). 3) Propose mitigation that avoids target leakage (e.g., remove post‑arrival features, impute within training folds, align to UTC with airport TZ map, drop DST transition minutes) and describe how you’d verify the fixes with before/after distributions and holdout data. 4) Define a lightweight data contract for upstream producers (types, ranges, allowed nulls, semantic rules) and how you’d enforce it in CI with schema tests and sampling.

PracHub · Accepted Answer

This question evaluates data quality auditing competencies, including validation of timestamps and timezones, detection of label leakage and inconsistent units, provenance and schema enforcement, and hands‑on use of SQL/Python for dataset inspection.

Quick Overview

Quick Overview