Design a scalable job to compute rolling 15-minute meeting concurrency and room utilization
You are given a massive log of meeting events with fields:
-
user_id
-
start_time (UTC)
-
end_time (UTC)
-
room_id
-
office_id
Compute, for each rolling 15-minute window (assume slide = 1 minute unless specified otherwise):
-
Concurrent meeting counts per office (report peak concurrent meetings within each 15-minute window).
-
Rooms whose utilization exceeds 80% within the window, where utilization = occupied_time_in_window / 900 seconds.
Design a MapReduce or Spark job that:
-
Specifies map keys/values and how you transform intervals into windowed aggregates (e.g., line-sweep with +1/−1 at boundaries vs. interval splitting into windows).
-
Describes the shuffle/partition strategy to minimize skew and bound reducer work.
-
Explains handling of late and duplicated events (event-time watermarks, dedup keys, update/retraction strategy).
-
Proposes validation and correctness checks at scale.
Make minimal, explicit assumptions if needed (e.g., presence of meeting_id for dedup, window slide granularity).