Compute Effective Reads with SQL and Python Streaming
Company: Meta
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Onsite
post_read_events
+----------+---------+-----------+---------+--------------------+---------------------+---------------------+
| event_id | user_id | session_id| post_id | watch_duration_sec | screen_coverage_pct | event_timestamp |
+----------+---------+-----------+---------+--------------------+---------------------+---------------------+
| 1 | 101 | s1 | 501 | 2.5 | 75 | 2023-06-01 10:00:01 |
| 2 | 101 | s1 | 501 | 3.0 | 80 | 2023-06-01 10:00:05 |
| 3 | 102 | s2 | 502 | 6.0 | 65 | 2023-06-01 10:01:09 |
| 4 | 101 | s3 | 503 | 1.5 | 90 | 2023-06-01 11:15:21 |
| 5 | 102 | s2 | 502 | 4.0 | 70 | 2023-06-01 10:02:15 |
+----------+---------+-----------+---------+--------------------+---------------------+---------------------+
##### Scenario
Event-level log of post impressions for a newsfeed; need to compute effective reads in batch (SQL) and in near-real-time (Python streaming).
##### Question
Write a SQL query that returns every post_id whose total watch_duration_sec across all views exceeds X seconds AND whose maximum screen_coverage_pct across those views exceeds Y.
Using the same data consumed as a stream, write Python code that groups events into sessions (end a session if no event arrives for ≥30 s), then emits, for each session, the posts that satisfy the same effective-read criteria along with their total watch time and max screen coverage.
##### Hints
Window functions or GROUP BY for SQL; for Python, think generators, stateful dictionaries, and a session timeout watermark.
Quick Answer: This question evaluates proficiency in event-level data aggregation, sessionization, windowing, and stateful stream processing across SQL and Python, including computing aggregated metrics such as total watch duration and peak screen coverage.