This question evaluates a candidate's competency in designing scalable distributed systems and large-scale data processing pipelines, focusing on ingestion, partitioning/sharding, partial aggregation and merging, global top‑K computation, fault tolerance, idempotency, and storage/serving concerns.
You need to design a distributed system that computes word frequencies over terabytes of text data. The system must not use MapReduce but should still scale horizontally and produce both global counts and top‑K words. Assume the data can arrive as batch files or continuous streams.
Describe an end‑to‑end design that covers:
Make reasonable assumptions explicit. Provide diagrams verbally if helpful and include any small examples to clarify top‑K and salted key combination.
Login required