Process real-time enter/exit events and actives
Company: Amazon
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Onsite
You receive a real-time stream of events with schema: user_id (str), channel (str), event_type ("enter"|"exit"), ts (UTC ISO timestamp). A user can ‘enter’ and ‘exit’ multiple times per channel; events may arrive up to 5 minutes late or out-of-order.
Tasks:
1) Batch (pandas): Given a day of data, compute per-channel active_user_count for every 1-minute tumbling window, assuming missing exits imply an implicit exit at the next enter for the same channel or at day-end (state your assumption). Handle overlapping sessions and duplicate events robustly. Output columns: window_start, channel, active_user_count.
2) Top channels: For each minute, return the top 3 channels by active_user_count (ties broken lexicographically), and include dense rank per minute.
3) Streaming design: Outline a solution that produces the same outputs with event-time windows, 5-minute allowed lateness, and idempotent processing (exactly-once semantics if possible). Discuss state keys, watermarks, late-event handling, and how you would compact long-lived state.
4) Correctness and performance: Explain how you’d detect and repair clock skew, dedupe near-duplicates, and bound memory when the active set spikes. Provide big-O for steady state and worst case.
5) Edge cases: How do you reconcile an ‘exit’ with no prior ‘enter’, or overlapping sessions by the same user in the same channel?
Quick Answer: This question evaluates a candidate's competence in event-time processing, stateful windowed aggregations, deduplication, late and out-of-order event handling, and streaming-system design including state management and scalability.