Implement streaming CTR with deduplication
Company: Roblox
Role: Data Scientist
Category: Coding & Algorithms
Difficulty: Medium
Interview Round: Technical Screen
Quick Answer: This question evaluates streaming data-processing and algorithmic design skills, specifically time-windowed aggregation, click deduplication, late-event handling, state management, and selection of data structures for per-campaign CTR computation.
Constraints
- Timestamps are valid ISO 8601 strings, sorted non-decreasing by ts within each stream.
- An event is in-window iff window_start < ts <= query_ts.
- At most one click is counted per imp_id (earliest wins).
- Clicks whose imp_id never appears in impressions are ignored.
- CTR is rounded to 4 decimal places; 0.0 when impressions_in_window == 0.
- Result is sorted ascending by campaign_id.
Examples
Input: ([{"imp_id": "i1", "campaign_id": "A", "ts": "2026-01-01T10:00:00+00:00"}, {"imp_id": "i2", "campaign_id": "A", "ts": "2026-01-01T11:00:00+00:00"}, {"imp_id": "i3", "campaign_id": "B", "ts": "2026-01-01T12:00:00+00:00"}], [{"click_id": "c1", "imp_id": "i1", "campaign_id": "A", "ts": "2026-01-01T10:05:00+00:00"}, {"click_id": "c2", "imp_id": "i1", "campaign_id": "A", "ts": "2026-01-01T10:06:00+00:00"}], "2026-01-01T13:00:00+00:00")
Expected Output: [('A', 2, 1, 0.5), ('B', 1, 0, 0.0)]
Explanation: Campaign A has 2 impressions and 2 clicks on i1; dedup keeps the earliest click only -> 1 click, CTR 1/2 = 0.5. Campaign B has 1 impression, no clicks -> CTR 0.0. All events fall within the 24h window ending 13:00.
Input: ([], [], "2026-01-01T13:00:00+00:00")
Expected Output: []
Explanation: No events at all -> empty result.
Input: ([{"imp_id": "i1", "campaign_id": "A", "ts": "2026-01-01T10:00:00+00:00"}], [{"click_id": "c1", "imp_id": "iX", "campaign_id": "A", "ts": "2026-01-01T10:05:00+00:00"}], "2026-01-01T11:00:00+00:00")
Expected Output: [('A', 1, 0, 0.0)]
Explanation: The click references imp_id 'iX' which was never seen as an impression, so it is ignored. Campaign A has 1 impression and 0 valid clicks -> CTR 0.0.
Input: ([{"imp_id": "i1", "campaign_id": "A", "ts": "2026-01-01T10:00:00+00:00"}, {"imp_id": "i2", "campaign_id": "A", "ts": "2026-01-03T10:00:00+00:00"}], [{"click_id": "c1", "imp_id": "i1", "campaign_id": "A", "ts": "2026-01-01T10:05:00+00:00"}, {"click_id": "c2", "imp_id": "i2", "campaign_id": "A", "ts": "2026-01-03T10:05:00+00:00"}], "2026-01-03T11:00:00+00:00")
Expected Output: [('A', 1, 1, 1.0)]
Explanation: Querying at 2026-01-03 11:00, the 24h window starts 2026-01-02 11:00. The Jan-01 impression/click are expired out of the window; only i2 and its click remain -> 1 impression, 1 click, CTR 1.0.
Input: ([{"imp_id": "i1", "campaign_id": "A", "ts": "2026-01-01T10:00:00+00:00"}, {"imp_id": "i2", "campaign_id": "A", "ts": "2026-01-01T10:10:00+00:00"}, {"imp_id": "i3", "campaign_id": "A", "ts": "2026-01-01T10:20:00+00:00"}], [{"click_id": "c1", "imp_id": "i2", "campaign_id": "A", "ts": "2026-01-01T10:15:00+00:00"}, {"click_id": "c2", "imp_id": "i3", "campaign_id": "A", "ts": "2026-01-01T10:25:00+00:00"}, {"click_id": "c3", "imp_id": "i3", "campaign_id": "A", "ts": "2026-01-01T10:22:00+00:00"}], "2026-01-01T11:00:00+00:00")
Expected Output: [('A', 3, 2, 0.6667)]
Explanation: 3 impressions on A. i2 has one click; i3 has two clicks (a later c2 and an earlier, late-arriving c3) -> dedup keeps the earliest (10:22) so it counts once. 2 dedup clicks / 3 impressions = 0.6667.
Hints
- Build a map imp_id -> campaign_id from ALL impressions so you can validate and attribute clicks, but only count impressions whose ts falls in the window.
- Deduplicate clicks by tracking the earliest timestamp per imp_id BEFORE applying the window filter to clicks.
- Decide a click's window membership by its earliest deduped timestamp, not by any later duplicate.
- A campaign appears in the output if it has any in-window impression OR any in-window dedup click; guard division by zero for CTR.