Write SQL for streaming cohort metrics
Company: HBO
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Take-home Project
Given the schema and sample data below, write SQL to answer the three tasks. Use standard SQL (window functions allowed). Assume timestamps are UTC.
Tables:
users(user_id INT, signup_date DATE)
content(content_id INT, title TEXT, director_id INT)
directors(director_id INT, name TEXT)
views(user_id INT, content_id INT, start_time TIMESTAMP, end_time TIMESTAMP, seconds_watched INT)
experiments(user_id INT, exp_id TEXT, variant CHAR(1), exposure_time TIMESTAMP)
cancellations(user_id INT, cancel_time TIMESTAMP)
Sample rows (minimal):
users
+---------+-------------+
| user_id | signup_date |
+---------+-------------+
| 1 | 2022-07-20 |
| 2 | 2022-07-25 |
| 3 | 2022-08-02 |
| 4 | 2022-08-10 |
+---------+-------------+
content
+------------+-------------+-------------+
| content_id | title | director_id |
+------------+-------------+-------------+
| 10 | Movie1 | 100 |
| 11 | Movie2 | 100 |
| 12 | DocuA | 101 |
+------------+-------------+-------------+
directors
+-------------+-----------+
| director_id | name |
+-------------+-----------+
| 100 | Dir Alpha |
| 101 | Dir Beta |
+-------------+-----------+
views
+---------+------------+---------------------+---------------------+----------------+
| user_id | content_id | start_time | end_time | seconds_watched|
+---------+------------+---------------------+---------------------+----------------+
| 1 | 10 | 2022-08-05 12:05:00 | 2022-08-05 13:35:00 | 5400 |
| 1 | 11 | 2022-08-06 09:00:00 | 2022-08-06 09:45:00 | 2700 |
| 2 | 12 | 2022-08-06 20:00:00 | 2022-08-06 20:20:00 | 1200 |
| 3 | 10 | 2022-08-10 08:00:00 | 2022-08-10 08:30:00 | 1800 |
| 4 | 11 | 2022-08-15 21:00:00 | 2022-08-15 21:05:00 | 300 |
+---------+------------+---------------------+---------------------+----------------+
experiments
+---------+--------+---------+---------------------+
| user_id | exp_id | variant | exposure_time |
+---------+--------+---------+---------------------+
| 1 | HP123 | A | 2022-08-05 11:00:00 |
| 2 | HP123 | B | 2022-08-06 19:50:00 |
| 3 | HP123 | A | 2022-08-10 07:55:00 |
| 4 | HP123 | B | 2022-08-15 20:30:00 |
+---------+--------+---------+---------------------+
cancellations
+---------+---------------------+
| user_id | cancel_time |
+---------+---------------------+
| 2 | 2022-08-20 10:00:00 |
+---------+---------------------+
Tasks:
A) For August 2022, by director, compute: unique viewers, total watch-hours, and median watch-time per viewer. Treat multiple views by the same user for the same director as one viewer; use the sum of seconds_watched per user per director to compute the per-viewer median.
B) For experiment HP123, compute 7-day post-exposure cancellation rates by variant among users exposed before any cancellation (i.e., ignore exposures after cancel_time). Use exposure_time as day 0 and include cancellations with 0 < t <= 7 days.
C) For August 2022, return the top 3 titles by unique viewers per variant (A/B) among users who were exposed to HP123 before their first August view. Break ties by higher total watch-hours, then lexicographically by title.
Describe any assumptions you need (e.g., one active subscription per user), and write efficient SQL for each task.
Quick Answer: This question evaluates proficiency in SQL-based data manipulation and analytics, focusing on cohort analysis, temporal filtering, aggregation, deduplication and experiment metric computation within the Data Manipulation (SQL/Python) domain and the ability to translate business metric definitions into queryable datasets.