Solve advanced SQL for streaming analytics
Company: Twitch
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Technical Screen
You are given minute-level streaming telemetry. Each row in minute_streamed represents one minute during which a streamer was live; each row in minute_viewed represents one viewer watching a streamer for one minute.
Schema
- minute_streamed(time_minute TIMESTAMP, streamer_username VARCHAR, category VARCHAR, concurrent_viewers INT)
- minute_viewed(time_minute TIMESTAMP, viewer_username VARCHAR, viewer_country VARCHAR, streamer_username VARCHAR)
Sample data (small and illustrative)
minute_streamed
+---------------------+------------------+----------+--------------------+
| time_minute | streamer_username| category | concurrent_viewers |
+---------------------+------------------+----------+--------------------+
| 2018-12-31 23:59:00 | aaa | TTBHGD | 100 |
| 2019-01-01 00:00:00 | aaa | TTBHGD | 120 |
| 2019-01-01 00:01:00 | aaa | TTBHGD | 130 |
| 2019-03-15 09:00:00 | bbb | VGDH | 50 |
| 2019-03-15 09:01:00 | bbb | VGDH | 60 |
| 2019-03-31 22:15:00 | bbb | CCVF | 70 |
| 2020-03-19 13:00:00 | aaa | TTBHGD | 133 |
| 2020-03-19 13:01:00 | aaa | TTBHGD | 45 |
+---------------------+------------------+----------+--------------------+
minute_viewed
+---------------------+----------------+----------------+------------------+
| time_minute | viewer_username| viewer_country | streamer_username|
+---------------------+----------------+----------------+------------------+
| 2019-01-01 00:00:00 | ccc | US | aaa |
| 2019-01-01 00:01:00 | ccc | US | aaa |
| 2019-03-15 09:00:00 | ddd | JP | bbb |
| 2019-03-15 09:00:00 | eee | US | bbb |
| 2019-03-15 09:01:00 | eee | US | bbb |
| 2019-03-31 22:15:00 | fff | US | bbb |
| 2020-03-19 13:00:00 | ccc | US | aaa |
+---------------------+----------------+----------------+------------------+
Assumptions
- A streamer may switch categories at any minute.
- Months are calendar months based on time_minute (UTC). Label months as YYYY-MM.
- 1 row in minute_streamed = 1 minute streamed; hours = minutes/60.
- You may use ANSI SQL with CTEs and window functions.
Tasks
1) Compute total monthly hours streamed for each month across all streamers. Output (month_yyyy_mm, hours_streamed). Order chronologically. Ensure it works when data spans multiple years (e.g., 2018-12 and 2019-01).
2) For each streamer, return their total streamed minutes and the share accounted for categories whose name contains a given keyword, case-insensitive. Input parameter: cat_keyword VARCHAR. Output (streamer_username, total_minutes, keyword_minutes, keyword_share = keyword_minutes/total_minutes). Treat category matching as case-insensitive substring search.
3) Find, for each streamer and month, whether their streamed hours increased versus the immediately previous calendar month. Treat a missing previous month as 0 minutes (so if a streamer did not stream in February but streamed in March, March counts as an increase over February=0). Handle year boundaries correctly (e.g., compare 2019-01 to 2018-12). Output rows only for months where current_month_minutes > prev_month_minutes: (streamer_username, month_yyyy_mm, current_minutes, prev_minutes).
4) For the year 2019 only, return for each streamer:
- avg_concurrent_viewers_2019: the average of concurrent_viewers over that streamer's streamed minutes in 2019 (use minute_streamed only).
- us_viewer_minutes_2019: the total number of viewer-minutes from US viewers in 2019. Join minute_viewed to minute_streamed on both (streamer_username, time_minute) so you don’t multiply counts. Output (streamer_username, avg_concurrent_viewers_2019, us_viewer_minutes_2019). As a follow-up, also compute unique_streaming_minutes_with_us_2019: the count of distinct streamed minutes in 2019 where the streamer had at least one US viewer.
5) Conceptual: Briefly explain SQL’s logical query processing order (FROM/JOIN → WHERE → GROUP BY → HAVING → SELECT → ORDER BY) and answer: logically, does AVG() happen before or after JOIN?
Deliverables
- Provide SQL for tasks 1–4 and a 1–2 sentence explanation for task 5. Ensure your queries are robust to mixed-case categories and months with no activity.
Quick Answer: This question evaluates advanced SQL competency for streaming analytics — covering time-series aggregation, monthly bucketing across year boundaries, joins between event tables, window functions, and case-insensitive substring filtering; the domain is Data Manipulation (SQL/Python) and the level is practical application of SQL techniques.