System Design: Distributed Word-Frequency Computation (No MapReduce)
Context
Design a distributed system that computes word frequencies over terabytes of text data without using MapReduce. Your design should support both batch (historical backfills) and streaming (near real-time) modes and scale horizontally.
Requirements
Describe and justify your approach for the following:
-
Ingestion
-
Batch input from large files in object storage.
-
Streaming input from event/log pipelines.
-
Partitioning & Routing
-
Hash-by-token with consistent hashing and virtual nodes.
-
Handling hot keys and data skew (e.g., stop-word mitigation, heavy-hitter routing).
-
Intermediate Storage
-
Durable buffering between stages; topic/queue design; compaction vs retention.
-
Aggregation Topology
-
Combiner layer, key ownership, state storage (e.g., local state stores), and how updates flow.
-
Processing Semantics
-
Exactly-once or at-least-once guarantees; idempotency strategy.
-
Fault Tolerance & Replay
-
Checkpointing state and offsets, recovery, reprocessing.
-
Backpressure & Flow Control
-
How slow consumers protect the system; throttling strategies.
-
Horizontal Scaling & Rebalancing
-
Adding/removing nodes; consistent hashing and state migration.
-
Monitoring & Operations
-
Metrics, alerts, tracing, skew detection, capacity signals.
-
SLA/SLO Considerations
-
Latency/throughput targets for streaming; completion targets for batch; error budgets.
-
Batch vs Streaming Trade-offs
-
Contrast the two approaches and justify your design choices.