This question evaluates skills in distributed systems and large-scale analytics, covering data partitioning, streaming and sliding-window processing, quantile and heavy-hitter summarization, fault tolerance, and trade-off analysis for time, space, and network I/O.
You have a dataset with hundreds of billions of records spread across many machines. Design how to compute the global median and the global mode. Specify the data-partitioning strategy, local sketches or summaries, how reducers aggregate partial results, and how you minimize network I/O. Compare exact vs approximate approaches (e.g., quantile sketches, heavy-hitter sketches), discuss skew handling, fault tolerance, and incremental/streaming updates. Provide time, space, and communication complexity, and extend the design to a sliding-window median/mode over an unbounded stream.
Quick Answer: This question evaluates skills in distributed systems and large-scale analytics, covering data partitioning, streaming and sliding-window processing, quantile and heavy-hitter summarization, fault tolerance, and trade-off analysis for time, space, and network I/O.
Design a Distributed System for Global Median and Global Mode at Massive Scale
Context
You are designing a distributed analytics system that computes two statistics over a single dataset:
the
global median
of a numeric field, and
the
global mode
(most frequent value) of a categorical/integer field.
The dataset holds hundreds of billions of records spread across many machines. Each record contains a single value x — treat it as numeric when computing the median (ordered) and as categorical or integer when computing the mode (equality only).
The system must support both batch and streaming workloads (including sliding windows over an unbounded stream) and be efficient, fault-tolerant, and scalable. You may make minimal assumptions as long as you state them explicitly.
The central tension to keep in mind throughout: the median is an ordered statistic while the mode is an equality-only statistic, and at this scale the dominant cost is cross-machine communication, not local CPU. A strong design treats these as two distinct queries rather than forcing one pipeline onto both.
Constraints & Assumptions
These anchor the prompt; if the interviewer leaves them open, state your own and proceed.
Volume:N≫1011
records, partitioned across roughly
p≈103
worker machines.
Record shape:
each record is a single value
x
; numeric for the median (total order exists), categorical/integer for the mode (equality only).
Two regimes:
a one-shot
batch
computation over data at rest, and a
streaming
computation over an unbounded, possibly out-of-order stream — including a
sliding window
of the most recent
W
time units.
Cost model:
treat network bytes moved and the busiest single link as the bottleneck; local sequential scans are cheap and embarrassingly parallel.
Approximation is on the table:
small, bounded error is acceptable when the interviewer signals a latency or network budget, but you should be able to articulate exact methods too.
Clarifying Questions to Ask
Before designing, scope the problem with the interviewer:
Exact or approximate?
Is a bounded-error answer acceptable (and what error budget), or must the result be provably exact?
Latency / freshness SLA?
Sub-second dashboard refresh, or an offline batch job that can take minutes and several passes over the data?
Value domain?
Is the numeric domain small and known (e.g. ages, ratings) so a fixed histogram suffices, or unbounded/continuous? How many
distinct
values
D
for the mode — near
N
or far smaller?
Workload split?
Read:write ratio, append-only vs. updatable, and how skewed the data is (are there a few wildly hot values or value ranges?).
Window semantics?
For the streaming case: window size
W
, event-time vs. processing-time, and tolerated lateness.
Failure budget?
Effectively-once vs. at-least-once results, and how much state we can afford to checkpoint.
What to Design
Work through the following, treating median and mode as distinct queries (their ordered-vs-equality nature may call for different strategies):
Data partitioning strategy
— specify how data is partitioned to compute each statistic, for both
batch
and
streaming
modes:
Global
median
Global
mode
Local summaries
— define the sketches or summaries each worker computes over its local shard.
Reducer aggregation
— describe how reducers combine partial results into the final answer.
Network I/O
— explain how you
minimize network I/O
, and
why
this is the dominant cost at this scale.
Exact vs. approximate
— compare exact and approximate approaches (e.g.
quantile sketches
for the median,
heavy-hitter sketches
for the mode), with their
error and resource trade-offs
.
Skew and hot keys
— discuss how you detect and handle
data skew
and
hot keys
.
Fault tolerance and recovery
— address node/coordinator failures and recovery.
Incremental / streaming updates
— support continuous updates as new records arrive.
Complexity
— provide
time, space, and communication
complexity for your approaches.
Sliding window
— extend the design to a
sliding-window median and mode
over an unbounded stream.
What a Strong Answer Covers
The interviewer is listening for these dimensions, not a memorized recipe:
Treats median and mode as structurally different queries
— order-based vs. equality-based — and chooses a partitioning for each that matches its access pattern, rather than forcing one pipeline onto both.
Communication-first reasoning
— recognizes the shuffle, not CPU, is the bottleneck, and designs to move
O(summaries)
instead of
O(N)
raw data.
Local summaries chosen with intent
— a bounded-size, mergeable summary appropriate to each query, emitting
O(summary)
per worker rather than raw rows, with a clear combine step.
Correct aggregation primitives
— associative + commutative merges enabling tree aggregation, with awareness of additivity vs. idempotency.
Exact vs. approximate trade-off articulated with real guarantees
— names the sketches it relies on and states the
actual
error model each one provides (rather than assuming every sketch carries a formal bound), and knows exact mode must account for every distinct key.
Skew & hot-key handling
for both queries, with detection plus mitigation.
Fault tolerance
for batch and streaming, including the non-idempotent retry hazard.
Streaming + windowing
— handles out-of-order arrival and lateness, evicts expired data from non-deletable summaries, and reasons about the resulting error/cost implications.
Complexity stated on all three axes
(time, space, communication) plus error, ideally with a worked sizing sanity-check.
Follow-up Questions
After the main design, expect deeper probes:
What breaks first at 100× scale?
Which resource saturates — coordinator incast, hot-key reducers, sketch memory, or checkpoint overhead — and how do you mitigate it?
Exact streaming median is the expensive case
— walk through
why
re-imposing order on a live stream is costlier than the approximate path, and what you'd give up to make it affordable.
Tail vs. median accuracy
— if the team later needs p99 (not just p50), how does your sketch choice change, and why might one sketch be strong at the tails but weaker at the median?
Compounding error
— in the windowed mode path that ages counts, how do the windowed-count error and the frequency-sketch overcount interact, and how would you budget the two error parameters jointly?