This question evaluates skills in distributed systems and large-scale analytics, covering data partitioning, streaming and sliding-window processing, quantile and heavy-hitter summarization, fault tolerance, and trade-off analysis for time, space, and network I/O.

You are designing a distributed analytics system that must compute the global median and the global mode over a dataset with hundreds of billions of records stored across many machines. The system should support both batch and streaming (including sliding windows) and be efficient, fault-tolerant, and scalable.
Assume records each contain a single value x (numeric for median; categorical or integer for mode). You may make minimal, explicit assumptions as needed.
Login required