Design distributed word count without MapReduce

Q: Design distributed word count without MapReduce

This question evaluates a candidate's competency in designing scalable distributed systems and large-scale data processing pipelines, focusing on ingestion, partitioning/sharding, partial aggregation and merging, global top‑K computation, fault tolerance, idempotency, and storage/serving concerns.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

System Design: Distributed Word Frequency Counting (No MapReduce)

Context

You need to design a distributed system that computes word frequencies over terabytes of text data. The system must not use MapReduce but should still scale horizontally and produce both global counts and top‑K words. Assume the data can arrive as batch files or continuous streams.

Requirements

Describe an end‑to‑end design that covers:

Ingestion
- How raw text enters the system (e.g., log/stream service).
- Chunking, schema, and ordering assumptions.
Partitioning/sharding
- How tokens are partitioned across shards (e.g., consistent hashing).
- Mitigating hot keys (e.g., salting/splitting heavy tokens).
Aggregation
- How partial counts are produced and aggregated.
- How to combine salted partitions to produce global per‑word counts.
Global results and top‑K
- How to compute exact global counts and global top‑K across shards.
- Clarify if approximation is acceptable and which algorithm you’d use.
Reliability and correctness
- Fault tolerance and recovery from node failures.
- Idempotency and handling retries.
- Exactly‑once vs at‑least‑once delivery semantics (trade‑offs and how you’d achieve each).
- Backpressure and flow control.
Storage and serving
- Where results are stored (per‑word counts, snapshots, metadata).
- How they are exposed (APIs, latency, consistency expectations).

Make reasonable assumptions explicit. Provide diagrams verbally if helpful and include any small examples to clarify top‑K and salted key combination.

Design distributed word count without MapReduce

Overview

System Design: Distributed Word Frequency Counting (No MapReduce)

Context

Requirements

Comments (0)