Write SQL for streaming cohort metrics

Q: Write SQL for streaming cohort metrics

This question evaluates proficiency in SQL-based data manipulation and analytics, focusing on cohort analysis, temporal filtering, aggregation, deduplication and experiment metric computation within the Data Manipulation (SQL/Python) domain and the ability to translate business metric definitions into queryable datasets.

Q: How do I practice SQL interview questions?

PracHub provides an interactive SQL console where you can write and test queries against real database schemas. Get instant feedback and compare your solution with the expected output.

Q: What difficulty level is this coding question?

This is a Medium difficulty Data Manipulation (SQL/Python) question, commonly asked during Take-home Project rounds at HBO.

Q: What role is this question designed for?

This question is commonly asked for Data Scientist candidates at HBO during technical interviews.

Question

Given the schema and sample data below, write SQL to answer the three tasks. Use standard SQL (window functions allowed). Assume timestamps are UTC.

Tables:
users(user_id INT, signup_date DATE)
content(content_id INT, title TEXT, director_id INT)
directors(director_id INT, name TEXT)
views(user_id INT, content_id INT, start_time TIMESTAMP, end_time TIMESTAMP, seconds_watched INT)
experiments(user_id INT, exp_id TEXT, variant CHAR(1), exposure_time TIMESTAMP)
cancellations(user_id INT, cancel_time TIMESTAMP)

Sample rows (minimal):
users
+---------+-------------+
| user_id | signup_date |
+---------+-------------+
| 1 | 2022-07-20 |
| 2 | 2022-07-25 |
| 3 | 2022-08-02 |
| 4 | 2022-08-10 |
+---------+-------------+

content
+------------+-------------+-------------+
| content_id | title | director_id |
+------------+-------------+-------------+
| 10 | Movie1 | 100 |
| 11 | Movie2 | 100 |
| 12 | DocuA | 101 |
+------------+-------------+-------------+

directors
+-------------+-----------+
| director_id | name |
+-------------+-----------+
| 100 | Dir Alpha |
| 101 | Dir Beta |
+-------------+-----------+

views
+---------+------------+---------------------+---------------------+----------------+
| user_id | content_id | start_time | end_time | seconds_watched|
+---------+------------+---------------------+---------------------+----------------+
| 1 | 10 | 2022-08-05 12:05:00 | 2022-08-05 13:35:00 | 5400 |
| 1 | 11 | 2022-08-06 09:00:00 | 2022-08-06 09:45:00 | 2700 |
| 2 | 12 | 2022-08-06 20:00:00 | 2022-08-06 20:20:00 | 1200 |
| 3 | 10 | 2022-08-10 08:00:00 | 2022-08-10 08:30:00 | 1800 |
| 4 | 11 | 2022-08-15 21:00:00 | 2022-08-15 21:05:00 | 300 |
+---------+------------+---------------------+---------------------+----------------+

experiments
+---------+--------+---------+---------------------+
| user_id | exp_id | variant | exposure_time |
+---------+--------+---------+---------------------+
| 1 | HP123 | A | 2022-08-05 11:00:00 |
| 2 | HP123 | B | 2022-08-06 19:50:00 |
| 3 | HP123 | A | 2022-08-10 07:55:00 |
| 4 | HP123 | B | 2022-08-15 20:30:00 |
+---------+--------+---------+---------------------+

cancellations
+---------+---------------------+
| user_id | cancel_time |
+---------+---------------------+
| 2 | 2022-08-20 10:00:00 |
+---------+---------------------+

Tasks:
A) For August 2022, by director, compute: unique viewers, total watch-hours, and median watch-time per viewer. Treat multiple views by the same user for the same director as one viewer; use the sum of seconds_watched per user per director to compute the per-viewer median.
B) For experiment HP123, compute 7-day post-exposure cancellation rates by variant among users exposed before any cancellation (i.e., ignore exposures after cancel_time). Use exposure_time as day 0 and include cancellations with 0 < t <= 7 days.
C) For August 2022, return the top 3 titles by unique viewers per variant (A/B) among users who were exposed to HP123 before their first August view. Break ties by higher total watch-hours, then lexicographically by title.
Describe any assumptions you need (e.g., one active subscription per user), and write efficient SQL for each task.

PracHub · Accepted Answer

This question evaluates proficiency in SQL-based data manipulation and analytics, focusing on cohort analysis, temporal filtering, aggregation, deduplication and experiment metric computation within the Data Manipulation (SQL/Python) domain and the ability to translate business metric definitions into queryable datasets.

Quick Overview

Quick Overview