PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/System Design/Anthropic

Design distributed median and mode

Last updated: Apr 24, 2026

Quick Overview

This question evaluates skills in distributed systems and large-scale analytics, covering data partitioning, streaming and sliding-window processing, quantile and heavy-hitter summarization, fault tolerance, and trade-off analysis for time, space, and network I/O.

  • hard
  • Anthropic
  • System Design
  • Software Engineer

Design distributed median and mode

Company: Anthropic

Role: Software Engineer

Category: System Design

Difficulty: hard

Interview Round: Onsite

You have a dataset with hundreds of billions of records spread across many machines. Design how to compute the global median and the global mode. Specify the data-partitioning strategy, local sketches or summaries, how reducers aggregate partial results, and how you minimize network I/O. Compare exact vs approximate approaches (e.g., quantile sketches, heavy-hitter sketches), discuss skew handling, fault tolerance, and incremental/streaming updates. Provide time, space, and communication complexity, and extend the design to a sliding-window median/mode over an unbounded stream.

Quick Answer: This question evaluates skills in distributed systems and large-scale analytics, covering data partitioning, streaming and sliding-window processing, quantile and heavy-hitter summarization, fault tolerance, and trade-off analysis for time, space, and network I/O.

Related Interview Questions

  • Design a One-on-One Chat Service - Anthropic (medium)
  • Design a prompt playground - Anthropic (hard)
  • Scale Duplicate File Detection - Anthropic (medium)
  • Design a one-to-one chat system - Anthropic (medium)
  • Design One-to-One Chat - Anthropic (medium)
|Home/System Design/Anthropic

Design distributed median and mode

Anthropic logo
Anthropic
Sep 6, 2025, 12:00 AM
hardSoftware EngineerOnsiteSystem Design
155
0

Design a Distributed System for Global Median and Global Mode at Massive Scale

Context

You are designing a distributed analytics system that computes two statistics over a single dataset:

  • the global median of a numeric field, and
  • the global mode (most frequent value) of a categorical/integer field.

The dataset holds hundreds of billions of records spread across many machines. Each record contains a single value x — treat it as numeric when computing the median (ordered) and as categorical or integer when computing the mode (equality only).

The system must support both batch and streaming workloads (including sliding windows over an unbounded stream) and be efficient, fault-tolerant, and scalable. You may make minimal assumptions as long as you state them explicitly.

The central tension to keep in mind throughout: the median is an ordered statistic while the mode is an equality-only statistic, and at this scale the dominant cost is cross-machine communication, not local CPU. A strong design treats these as two distinct queries rather than forcing one pipeline onto both.

Constraints & Assumptions

These anchor the prompt; if the interviewer leaves them open, state your own and proceed.

  • Volume: N≫1011N \gg 10^{11}N≫1011 records, partitioned across roughly p≈103p \approx 10^3p≈103 worker machines.
  • Record shape: each record is a single value x ; numeric for the median (total order exists), categorical/integer for the mode (equality only).
  • Two regimes: a one-shot batch computation over data at rest, and a streaming computation over an unbounded, possibly out-of-order stream — including a sliding window of the most recent WWW time units.
  • Cost model: treat network bytes moved and the busiest single link as the bottleneck; local sequential scans are cheap and embarrassingly parallel.
  • Approximation is on the table: small, bounded error is acceptable when the interviewer signals a latency or network budget, but you should be able to articulate exact methods too.

Clarifying Questions to Ask

Before designing, scope the problem with the interviewer:

  • Exact or approximate? Is a bounded-error answer acceptable (and what error budget), or must the result be provably exact?
  • Latency / freshness SLA? Sub-second dashboard refresh, or an offline batch job that can take minutes and several passes over the data?
  • Value domain? Is the numeric domain small and known (e.g. ages, ratings) so a fixed histogram suffices, or unbounded/continuous? How many distinct values DDD for the mode — near NNN or far smaller?
  • Workload split? Read:write ratio, append-only vs. updatable, and how skewed the data is (are there a few wildly hot values or value ranges?).
  • Window semantics? For the streaming case: window size WWW , event-time vs. processing-time, and tolerated lateness.
  • Failure budget? Effectively-once vs. at-least-once results, and how much state we can afford to checkpoint.

What to Design

Work through the following, treating median and mode as distinct queries (their ordered-vs-equality nature may call for different strategies):

  1. Data partitioning strategy — specify how data is partitioned to compute each statistic, for both batch and streaming modes:
    • Global median
    • Global mode
  1. Local summaries — define the sketches or summaries each worker computes over its local shard.
  1. Reducer aggregation — describe how reducers combine partial results into the final answer.
  1. Network I/O — explain how you minimize network I/O , and why this is the dominant cost at this scale.
  1. Exact vs. approximate — compare exact and approximate approaches (e.g. quantile sketches for the median, heavy-hitter sketches for the mode), with their error and resource trade-offs .
  1. Skew and hot keys — discuss how you detect and handle data skew and hot keys .
  1. Fault tolerance and recovery — address node/coordinator failures and recovery.
  1. Incremental / streaming updates — support continuous updates as new records arrive.
  1. Complexity — provide time, space, and communication complexity for your approaches.
  1. Sliding window — extend the design to a sliding-window median and mode over an unbounded stream.

What a Strong Answer Covers

The interviewer is listening for these dimensions, not a memorized recipe:

  • Treats median and mode as structurally different queries — order-based vs. equality-based — and chooses a partitioning for each that matches its access pattern, rather than forcing one pipeline onto both.
  • Communication-first reasoning — recognizes the shuffle, not CPU, is the bottleneck, and designs to move O(summaries)O(\text{summaries})O(summaries) instead of O(N)O(N)O(N) raw data.
  • Local summaries chosen with intent — a bounded-size, mergeable summary appropriate to each query, emitting O(summary)O(\text{summary})O(summary) per worker rather than raw rows, with a clear combine step.
  • Correct aggregation primitives — associative + commutative merges enabling tree aggregation, with awareness of additivity vs. idempotency.
  • Exact vs. approximate trade-off articulated with real guarantees — names the sketches it relies on and states the actual error model each one provides (rather than assuming every sketch carries a formal bound), and knows exact mode must account for every distinct key.
  • Skew & hot-key handling for both queries, with detection plus mitigation.
  • Fault tolerance for batch and streaming, including the non-idempotent retry hazard.
  • Streaming + windowing — handles out-of-order arrival and lateness, evicts expired data from non-deletable summaries, and reasons about the resulting error/cost implications.
  • Complexity stated on all three axes (time, space, communication) plus error, ideally with a worked sizing sanity-check.

Follow-up Questions

After the main design, expect deeper probes:

  • What breaks first at 100× scale? Which resource saturates — coordinator incast, hot-key reducers, sketch memory, or checkpoint overhead — and how do you mitigate it?
  • Exact streaming median is the expensive case — walk through why re-imposing order on a live stream is costlier than the approximate path, and what you'd give up to make it affordable.
  • Tail vs. median accuracy — if the team later needs p99 (not just p50), how does your sketch choice change, and why might one sketch be strong at the tails but weaker at the median?
  • Compounding error — in the windowed mode path that ages counts, how do the windowed-count error and the frequency-sketch overcount interact, and how would you budget the two error parameters jointly?

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Anthropic•More Software Engineer•Anthropic Software Engineer•Anthropic System Design•Software Engineer System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.