Impute, join, and upsert using SQL and Python
Company: Capital One
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Technical Screen
Write both SQL and Python (pandas) to complete the following data-manipulation tasks. Assume today is 2025-09-01 for any time filters.
Schema:
customers(customer_id INT, signup_date DATE, age INT, tier TEXT)
events(event_id INT, customer_id INT, event_date DATE, event_type TEXT, amount DECIMAL(10,2))
staging_events(event_id INT, customer_id INT, event_date DATE, event_type TEXT, amount DECIMAL(10,2))
payments(payment_id INT, customer_id INT, payment_date DATE, amount DECIMAL(10,2))
Sample data (minimal):
customers
+-------------+-------------+-----+--------+
| customer_id | signup_date | age | tier |
+-------------+-------------+-----+--------+
| 1 | 2025-08-20 | 34 | gold |
| 2 | 2025-08-28 | NULL| silver |
| 3 | 2025-08-29 | 27 | silver |
| 4 | 2025-08-30 | NULL| bronze |
+-------------+-------------+-----+--------+
events
+----------+-------------+-------------+------------+--------+
| event_id | customer_id | event_date | event_type | amount |
+----------+-------------+-------------+------------+--------+
| 10 | 1 | 2025-08-28 | purchase | 120.00 |
| 11 | 2 | 2025-08-30 | purchase | 80.00 |
| 12 | 2 | 2025-09-01 | refund | -20.00 |
| 13 | 3 | 2025-08-26 | purchase | 60.00 |
| 14 | 4 | 2025-08-27 | page_view | NULL |
+----------+-------------+-------------+------------+--------+
staging_events
+----------+-------------+-------------+------------+--------+
| event_id | customer_id | event_date | event_type | amount |
+----------+-------------+-------------+------------+--------+
| 12 | 2 | 2025-09-01 | refund | -20.00 |
| 15 | 1 | 2025-08-31 | purchase | 120.00 |
+----------+-------------+-------------+------------+--------+
payments
+------------+-------------+--------------+--------+
| payment_id | customer_id | payment_date | amount |
+------------+-------------+--------------+--------+
| 100 | 1 | 2025-08-31 | 120.00 |
| 101 | 3 | 2025-08-31 | 60.00 |
+------------+-------------+--------------+--------+
Tasks:
A) Impute missing ages in customers using the median age within tier, falling back to the global median if a tier’s median is null; return customer_id and imputed_age.
B) Upsert from staging_events into events: insert rows whose event_id does not exist in events; if an event_id exists in both with different values, keep a single row with the latest event_date and its values; return the deduplicated events table.
C) For the last 7 days inclusive (2025-08-26 to 2025-09-01), compute per-tier net revenue where purchase amounts are positive and refund amounts are negative; exclude non-monetary events (like page_view); use imputed_age and restrict to customers aged 18–65; return tier, total_revenue_7d, and customer_count_7d.
D) Compute 7-day retention: among customers with a monetary event in the last 7 days, what fraction also had any event in the prior 7-day window (2025-08-19 to 2025-08-25)? Return one row per tier with retention_rate. Provide both SQL and pandas solutions; state any indexing choices and how you would test correctness.
Quick Answer: This question evaluates SQL and pandas proficiency in data manipulation tasks including missing-value imputation, tiered aggregation, joins and upsert/deduplication, time-windowed revenue computation, cohort retention analysis, and handling of monetary versus non-monetary events within a relational schema.