Design Top K ranking system
System Design: Real-time Top-K from a Large/Streaming Dataset
Context
You receive a continuous, high-volume stream of events, each referencing an item (e.g., item_id). The system must continuously identify the Top K most frequent items and serve low-latency queries. Assume:
-
Data volume is large (potentially millions of events per second), item cardinality can be high, and K is small (e.g., 10–1,000).
-
Queries may request Top K for different time windows (e.g., last 1 minute, 1 hour, 1 day) and potentially by a grouping key (e.g., per region, per tenant).
-
Results should be near-real-time with bounded staleness.
Requirements
Design a system that:
-
Ingests a large/streaming dataset and continuously identifies the Top K elements.
-
Chooses suitable data structures for exact and approximate solutions.
-
Scales horizontally across shards/partitions.
-
Handles updates: insertions, deletions/expirations (e.g., sliding windows), out-of-order/late events.
-
Supports high-throughput, low-latency queries (read path), including caching/materialization.
-
Discusses consistency, fault tolerance, and operational considerations.
Provide the design, trade-offs, and key algorithms/data structures. Include complexity and accuracy considerations.
Constraints & Assumptions
-
Preserve the scope, facts, inputs, and requested outputs from the prompt above.
-
If the prompt leaves a detail unspecified, state a reasonable assumption before relying on it.
-
Keep the answer interview-ready: concise enough to present, but concrete enough to implement or evaluate.
Clarifying Questions to Ask
-
Clarify users, core use cases, read/write patterns, scale, latency, availability, and data retention.
-
State explicit assumptions before making sizing or architecture decisions.
-
Prioritize the functional path first, then address reliability, security, observability, and rollout.
What a Strong Answer Covers
-
A scoped requirements summary with concrete non-goals and success metrics.
-
API, data model, architecture, consistency, capacity, and operations.
-
Reasoned trade-offs among simple and scalable designs, including bottlenecks and failure modes.
-
A validation, monitoring, migration, and launch plan appropriate for the risk level.
Follow-up Questions
-
What breaks first at 10x traffic or data volume?
-
How would you degrade gracefully during dependency failures?
-
What metrics and alerts would prove the design is healthy after launch?