Impute, join, and upsert using SQL and Python

Q: Impute, join, and upsert using SQL and Python

This is a Data Manipulation (SQL/Python) interview question from Capital One for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Data Manipulation (SQL/Python) interview questions?

Data Manipulation (SQL/Python) questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master data manipulation (sql/python) interviews.

Question

Write both SQL and Python (pandas) to complete the following data-manipulation tasks. Assume today is 2025-09-01 for any time filters. Schema: customers(customer_id INT, signup_date DATE, age INT, tier TEXT) events(event_id INT, customer_id INT, event_date DATE, event_type TEXT, amount DECIMAL(10,2)) staging_events(event_id INT, customer_id INT, event_date DATE, event_type TEXT, amount DECIMAL(10,2)) payments(payment_id INT, customer_id INT, payment_date DATE, amount DECIMAL(10,2)) Sample data (minimal): customers +-------------+-------------+-----+--------+ | customer_id | signup_date | age | tier | +-------------+-------------+-----+--------+ | 1 | 2025-08-20 | 34 | gold | | 2 | 2025-08-28 | NULL| silver | | 3 | 2025-08-29 | 27 | silver | | 4 | 2025-08-30 | NULL| bronze | +-------------+-------------+-----+--------+

events +----------+-------------+-------------+------------+--------+ | event_id | customer_id | event_date | event_type | amount | +----------+-------------+-------------+------------+--------+ | 10 | 1 | 2025-08-28 | purchase | 120.00 | | 11 | 2 | 2025-08-30 | purchase | 80.00 | | 12 | 2 | 2025-09-01 | refund | -20.00 | | 13 | 3 | 2025-08-26 | purchase | 60.00 | | 14 | 4 | 2025-08-27 | page_view | NULL | +----------+-------------+-------------+------------+--------+

staging_events +----------+-------------+-------------+------------+--------+ | event_id | customer_id | event_date | event_type | amount | +----------+-------------+-------------+------------+--------+ | 12 | 2 | 2025-09-01 | refund | -20.00 | | 15 | 1 | 2025-08-31 | purchase | 120.00 | +----------+-------------+-------------+------------+--------+

payments +------------+-------------+--------------+--------+ | payment_id | customer_id | payment_date | amount | +------------+-------------+--------------+--------+ | 100 | 1 | 2025-08-31 | 120.00 | | 101 | 3 | 2025-08-31 | 60.00 | +------------+-------------+--------------+--------+ Tasks: A) Impute missing ages in customers using the median age within tier, falling back to the global median if a tier’s median is null; return customer_id and imputed_age. B) Upsert from staging_events into events: insert rows whose event_id does not exist in events; if an event_id exists in both with different values, keep a single row with the latest event_date and its values; return the deduplicated events table. C) For the last 7 days inclusive (2025-08-26 to 2025-09-01), compute per-tier net revenue where purchase amounts are positive and refund amounts are negative; exclude non-monetary events (like page_view); use imputed_age and restrict to customers aged 18–65; return tier, total_revenue_7d, and customer_count_7d. D) Compute 7-day retention: among customers with a monetary event in the last 7 days, what fraction also had any event in the prior 7-day window (2025-08-19 to 2025-08-25)? Return one row per tier with retention_rate. Provide both SQL and pandas solutions; state any indexing choices and how you would test correctness.

Impute, join, and upsert using SQL and Python

Comments (0)