System Design: Distributed Word Frequency Counting (No MapReduce)
Context
You need to design a distributed system that computes word frequencies over terabytes of text data. The system must not use MapReduce but should still scale horizontally and produce both global counts and top‑K words. Assume the data can arrive as batch files or continuous streams.
Requirements
Describe an end‑to‑end design that covers:
-
Ingestion
-
How raw text enters the system (e.g., log/stream service).
-
Chunking, schema, and ordering assumptions.
-
Partitioning/sharding
-
How tokens are partitioned across shards (e.g., consistent hashing).
-
Mitigating hot keys (e.g., salting/splitting heavy tokens).
-
Aggregation
-
How partial counts are produced and aggregated.
-
How to combine salted partitions to produce global per‑word counts.
-
Global results and top‑K
-
How to compute exact global counts and global top‑K across shards.
-
Clarify if approximation is acceptable and which algorithm you’d use.
-
Reliability and correctness
-
Fault tolerance and recovery from node failures.
-
Idempotency and handling retries.
-
Exactly‑once vs at‑least‑once delivery semantics (trade‑offs and how you’d achieve each).
-
Backpressure and flow control.
-
Storage and serving
-
Where results are stored (per‑word counts, snapshots, metadata).
-
How they are exposed (APIs, latency, consistency expectations).
Make reasonable assumptions explicit. Provide diagrams verbally if helpful and include any small examples to clarify top‑K and salted key combination.