Write SQL and Python for funnels/retention
Company: Coinbase
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: HR Screen
Given the following schema and small samples, answer Parts A–C. Assume timestamps are UTC and "today" is 2025-09-01. Schema: users(id INT, country STRING, hire_date DATE, visa_status STRING, is_remote BOOLEAN); sessions(session_id INT, user_id INT, session_start TIMESTAMP, session_end TIMESTAMP, device STRING); events(event_id INT, session_id INT, user_id INT, event_time TIMESTAMP, event_type STRING, metadata JSON).
Sample tables:
users
+----+---------+------------+------------+-----------+
| id | country | hire_date | visa_status| is_remote |
+----+---------+------------+------------+-----------+
| 1 | US | 2025-08-15 | H1B | 1 |
| 2 | CN | 2024-12-01 | GC | 0 |
| 3 | IN | 2025-06-01 | H1B | 1 |
+----+---------+------------+------------+-----------+
sessions
+------------+---------+---------------------+---------------------+--------+
| session_id | user_id | session_start | session_end | device |
+------------+---------+---------------------+---------------------+--------+
| 10 | 1 | 2025-08-31 09:00:00 | 2025-08-31 09:30:00 | web |
| 11 | 1 | 2025-09-01 10:00:00 | 2025-09-01 10:07:00 | mobile |
| 12 | 2 | 2025-09-01 11:00:00 | 2025-09-01 11:20:00 | web |
| 13 | 3 | 2025-08-25 12:00:00 | 2025-08-25 12:10:00 | web |
+------------+---------+---------------------+---------------------+--------+
events
+----------+------------+---------+---------------------+------------+------------------------+
| event_id | session_id | user_id | event_time | event_type | metadata |
+----------+------------+---------+---------------------+------------+------------------------+
| 100 | 10 | 1 | 2025-08-31 09:01:00 | login | {} |
| 101 | 10 | 1 | 2025-08-31 09:05:00 | view | {"page":"pricing"} |
| 102 | 10 | 1 | 2025-08-31 09:10:00 | purchase | {"amount":100} |
| 103 | 11 | 1 | 2025-09-01 10:01:00 | login | {} |
| 104 | 11 | 1 | 2025-09-01 10:02:00 | view | {"page":"home"} |
| 105 | 12 | 2 | 2025-09-01 11:05:00 | login | {} |
| 106 | 12 | 2 | 2025-09-01 11:10:00 | purchase | {"amount":50} |
| 107 | 13 | 3 | 2025-08-25 12:05:00 | login | {} |
+----------+------------+---------+---------------------+------------+------------------------+
Part A (SQL): For each date d in [2025-08-26, 2025-09-01], produce one row with: date d; distinct users who started a session that day; distinct users with a login event that day; distinct users with a purchase event that day where the purchase event’s (user_id, session_id, event_time) falls within that user’s session window on day d; conversion rates login→purchase and session→purchase (both as decimals). Treat multiple events by the same user on the same day as one. Ignore events that do not match both user_id and session_id or are outside the session time window. Part B (SQL): For each user, compute first_login_date and first_purchase_date (based only on events that match a valid session window as above) and output a within_7_days flag indicating whether first_purchase_date occurs within 7 days inclusive of first_login_date. Part C (Python): Using pandas DataFrames with the same schemas, implement a function retention_1d(as_of="2025-09-01") that returns the percentage of users whose first_login_date is in [2025-08-25, 2025-09-01] and who have any valid event (matching a session window) on the day exactly one calendar day after their first_login_date. Clearly document how you handle duplicate or orphaned events and time boundaries.
Quick Answer: This question evaluates SQL and Python data-manipulation skills, including event-to-session joins, temporal window filtering, deduplication, JSON metadata handling, and computation of funnel and retention metrics like daily unique users and conversion rates.