Design metrics resilient to data quality

Q: Design metrics resilient to data quality

This is a Analytics & Experimentation interview question from Other for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Analytics & Experimentation interview questions?

Analytics & Experimentation questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master analytics & experimentation interviews.

Question

Design a robust metric and compute it using only window functions (no JOINs) to show how data-quality issues change conclusions. Schema: payments_raw(txn_id STRING, loan_id INT, event_time TIMESTAMP, amount DECIMAL(10,2), ingestion_time TIMESTAMP, source STRING). Sample: +--------+---------+---------------------+--------+---------------------+--------+ | txn_id | loan_id | event_time | amount | ingestion_time | source | +--------+---------+---------------------+--------+---------------------+--------+ | a1 | 1 | 2025-08-25 10:00:00 | 200.00 | 2025-08-25 10:00:05 | app | | a1_dup | 1 | 2025-08-25 10:00:00 | 200.00 | 2025-08-25 10:05:12 | web | | b1 | 1 | 2025-08-29 09:00:00 | 300.00 | 2025-09-03 12:00:00 | batch | | c1 | 2 | 2025-08-31 14:00:00 | 100.00 | 2025-08-31 14:00:03 | app | | d1 | 2 | 2025-09-01 08:00:00 | -50.00 | 2025-09-01 08:00:05 | app | +--------+---------+---------------------+--------+---------------------+--------+ Tasks: (a) Define precisely a '7-day arrival completeness' metric for the window [2025-08-25, 2025-09-01], where today = 2025-09-01: the share of unique payments with event_time on day D that have at least one ingested record within 7 days of event_time. State how you will de-duplicate payments using only window functions (e.g., partition by loan_id, amount, event_time within 2 minutes; keep the earliest ingestion_time). (b) Compute two daily time series for D in [2025-08-25, 2025-09-01]: naive_completeness (no dedup, only same-day arrivals) and robust_completeness (your dedup rule and 7-day wait; censor results for D where the 7-day window is incomplete as of 2025-09-01). (c) Output a summary table with D, naive_completeness, robust_completeness, bias = naive - robust, and a flag if negative amounts exist on D. Explain one real-world decision that would differ if you used the naive metric.

Design metrics resilient to data quality

Comments (0)