Write robust SQL for streaming analytics
Company: Twitch
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Technical Screen
Write efficient PostgreSQL SQL for the following streaming analytics tasks. Use the invented schema and sample data below. Handle duplicates, out-of-order events, and missing stops as specified.
Schema:
streams(stream_id INT PRIMARY KEY, creator_id INT, game_id INT, started_at TIMESTAMP, ended_at TIMESTAMP, is_partner BOOLEAN)
view_events(user_id INT, stream_id INT, ts TIMESTAMP, action TEXT CHECK (action IN ('start','stop')), device TEXT, country TEXT)
Sample data (streams):
+-----------+------------+---------+---------------------+---------------------+------------+
| stream_id | creator_id | game_id | started_at | ended_at | is_partner |
+-----------+------------+---------+---------------------+---------------------+------------+
| 101 | 1 | 10 | 2025-08-31 23:50:00 | 2025-09-01 02:10:00 | true |
| 102 | 2 | 11 | 2025-09-01 00:05:00 | 2025-09-01 01:00:00 | false |
| 103 | 1 | 12 | 2025-09-01 03:00:00 | 2025-09-01 04:00:00 | true |
+-----------+------------+---------+---------------------+---------------------+------------+
Sample data (view_events):
+---------+-----------+---------------------+--------+--------+---------+
| user_id | stream_id | ts | action | device | country |
+---------+-----------+---------------------+--------+--------+---------+
| 1001 | 101 | 2025-08-31 23:55:00 | start | mobile | US |
| 1001 | 101 | 2025-09-01 00:40:00 | stop | mobile | US |
| 1002 | 101 | 2025-09-01 00:00:00 | start | web | US |
| 1002 | 101 | 2025-09-01 00:20:00 | stop | web | US |
| 1001 | 102 | 2025-09-01 00:10:00 | start | mobile | US |
| 1001 | 102 | 2025-09-01 00:45:00 | stop | mobile | US |
| 1003 | 102 | 2025-09-01 00:20:00 | start | tv | CA |
| 1003 | 102 | 2025-09-01 00:50:00 | stop | tv | CA |
| 1002 | 103 | 2025-09-01 03:05:00 | start | web | US |
| 1002 | 103 | 2025-09-01 03:45:00 | stop | web | US |
+---------+-----------+---------------------+--------+--------+---------+
Assumptions to enforce in your SQL:
- Events can arrive out of order; de-duplicate exact duplicate rows.
- If a 'start' has no subsequent 'stop', treat its stop as LEAST(streams.ended_at, ts + interval '4 hours').
- Ignore any 'stop' that occurs before its last unmatched 'start' for the same user_id, stream_id.
Tasks:
A) Peak concurrency per stream: return stream_id, peak_concurrent_viewers, and the ts window where the peak occurs.
B) Multi-streaming overlaps on 2025-09-01: return user_id and the total minutes they watched two different streams concurrently for >= 5 minutes overlap windows.
C) 7-day new-viewer retention: considering users whose first-ever view occurred between 2025-08-24 and 2025-08-31, compute D+7 retention where a user is retained if they watch any stream for at least 2 minutes on their 7th day after first view. Assume today is 2025-09-01.
D) Top creators by average per-viewer watch time in the last 7 days (2025-08-26 to 2025-09-01): return creator_id and avg_watch_minutes among US viewers, restricted to creators with >= 100 unique US viewers; break ties by higher peak concurrency from Task A.
Quick Answer: This question evaluates a candidate's ability to author robust SQL for streaming analytics, covering de-duplication, event-time handling, sessionization, concurrency, overlap detection, and retention calculations on time-series view event data.