Implement deduped CTR/RPM aggregator over event stream
Company: Roblox
Role: Data Scientist
Category: Data Manipulation (SQL/Python)
Difficulty: Medium
Interview Round: Technical Screen
Implement a Python function to compute per-day, per-campaign CTR and RPM from an event stream with possible out-of-order and duplicate click events.
Input: An iterator of dicts, each with keys: {"dt": "YYYY-MM-DD", "event_time": ISO8601 string, "ad_request_id": str, "campaign_id": int, "event_type": "impression"|"click", "revenue_cents": int}. Events can arrive out of order; multiple clicks may share the same ad_request_id.
Rules:
1) Count exactly one impression per ad_request_id (if multiple impression events appear, count once).
2) Count at most one charged click per ad_request_id using the earliest click’s revenue; ignore later clicks for that ad_request_id.
3) For each (dt, campaign_id): CTR = charged_clicks / impressions; RPM = (sum(revenue_cents)/100) / impressions * 1000.
4) Memory constraints: process in O(U) additional memory where U is the number of active ad_request_ids seen in the current 24-hour window; do not materialize the full stream.
5) Time complexity: O(n) over n events.
6) Handle missing/invalid fields robustly (skip invalid records but continue), and support that an impression could arrive after a click for the same ad_request_id.
Output: A mapping {(dt, campaign_id): {"impressions": int, "charged_clicks": int, "ctr": float, "rpm": float}}.
Deliverables: the function with type hints and a docstring, plus two unit tests: (a) multiple clicks per request id; (b) out-of-order impression and click spanning two dates.
Quick Answer: This question evaluates competence in streaming event processing and deduplication, including stateful aggregation, time-window management, handling out-of-order and duplicate events, and computing CTR/RPM metrics.