This question evaluates a candidate's competency in designing scalable, stateful streaming analytics for monthly NEW vs RETURNING request metrics, focusing on event-time processing with late/out-of-order arrivals, deduplication, compact state and probabilistic data-structure trade-offs with quantifiable error bounds.
You receive a high-volume event stream of requests. Each event has at least: user_id, request_id (unique if available), event_time (convertible to a specified time zone). Events are mostly time-ordered but can arrive up to 7 days late. Duplicates may appear. You may keep limited state per user, with 8 GB RAM available per processing task. Up to 1B distinct users and 50K requests/sec overall.
Goal: For each calendar month in the specified time zone, emit at month close the counts and percentage shares of requests from NEW vs RETURNING users.
Definition: A request is NEW if its month equals that user's first-ever request month; otherwise RETURNING.
Login required