MapReduce Design: Common Availability From Busy Intervals
Context
You are given large-scale calendar data: each user has 0 or more busy intervals during the day. For a set of specified groups (each group is a set of user IDs), compute the common available time slots whose duration is at least d minutes. Assume all timestamps are normalized to UTC and intervals are half-open [start, end).
Input datasets:
-
Busy intervals: records (user_id, start_ts, end_ts)
-
Group membership: records (group_id, user_id)
-
Query parameter: minimum duration d (minutes)
Output:
-
For each (group_id, calendar_day), the list of common free intervals of length ≥ d.
Requirements
-
Define the Map outputs (keys/values), partitioning, and Reduce logic.
-
Explain how you discretize time (or avoid discretization) and the trade-offs.
-
Describe how you mitigate data skew (e.g., very large groups, rush-hour hotspots).
-
Explain how you validate correctness and performance at scale.