You receive a real-time stream of events with schema: user_id (str), channel (str), event_type ("enter"|"exit"), ts (UTC ISO timestamp). A user can ‘enter’ and ‘exit’ multiple times per channel; events may arrive up to 5 minutes late or out-of-order.
Tasks:
-
Batch (pandas): Given a day of data, compute per-channel active_user_count for every 1-minute tumbling window, assuming missing exits imply an implicit exit at the next enter for the same channel or at day-end (state your assumption). Handle overlapping sessions and duplicate events robustly. Output columns: window_start, channel, active_user_count.
-
Top channels: For each minute, return the top 3 channels by active_user_count (ties broken lexicographically), and include dense rank per minute.
-
Streaming design: Outline a solution that produces the same outputs with event-time windows, 5-minute allowed lateness, and idempotent processing (exactly-once semantics if possible). Discuss state keys, watermarks, late-event handling, and how you would compact long-lived state.
-
Correctness and performance: Explain how you’d detect and repair clock skew, dedupe near-duplicates, and bound memory when the active set spikes. Provide big-O for steady state and worst case.
-
Edge cases: How do you reconcile an ‘exit’ with no prior ‘enter’, or overlapping sessions by the same user in the same channel?