Distributed System Design: Global Median and Global Mode at Massive Scale
Context
You are designing a distributed analytics system that must compute the global median and the global mode over a dataset with hundreds of billions of records stored across many machines. The system should support both batch and streaming (including sliding windows) and be efficient, fault-tolerant, and scalable.
Assume records each contain a single value x (numeric for median; categorical or integer for mode). You may make minimal, explicit assumptions as needed.
Tasks
-
Specify data partitioning strategies (batch and streaming) to compute:
-
Global median
-
Global mode
-
Define local sketches/summaries computed on each worker.
-
Describe how reducers aggregate partial results.
-
Minimize network I/O (explain how and why).
-
Compare exact vs approximate approaches (quantile sketches, heavy-hitter sketches) with error/resource trade-offs.
-
Discuss handling data skew and hot keys.
-
Address fault tolerance and recovery.
-
Support incremental/streaming updates.
-
Provide time, space, and communication complexity.
-
Extend the design to a sliding-window median and mode over an unbounded stream.