Design real-time top-K products service
Company: Amazon
Role: Software Engineer
Category: System Design
Difficulty: hard
Interview Round: Technical Screen
##### Question
Design a real-time service that ingests a continuous stream of purchase events with the schema `(customer_id, item_id, timestamp)` and continuously computes and serves the **top-K most purchased items**. The service must support multiple rolling windows at once, for example:
- A short sliding window (e.g. last 5 minutes, or a configurable window `T` such as the last 1 hour), and
- A long window and **daily rolling totals** (e.g. last 24 hours, plus per-calendar-day totals with both a running "today" view and finalized prior days).
Work through the full design and address each of the following:
1. **Data model** for the ingested event and for the materialized top-K result records.
2. **Ingestion pipeline and partitioning** — durable log, partition key choice, and hot-key mitigation.
3. **Windowing and aggregation strategy** — sliding windows vs. tumbling micro-buckets with add/subtract, keyed windows, slide cadence, and how daily totals are finalized.
4. **Algorithms and data structures** for maintaining top-K at scale (e.g. count map + min-heap for exact, or approximate heavy-hitters such as Space-Saving / Misra-Gries and Count-Min Sketch + heap), including the two-stage local-then-global merge pattern.
5. **Event-time processing** with watermarks: handling **late, out-of-order, and duplicate** events; allowed-lateness budgets per window.
6. **State management and TTL** — embedded state store, checkpointing, per-window namespaces, and eviction.
7. **Fault tolerance and exactly-once vs. at-least-once** semantics.
8. **Horizontal scalability** — partitioning, key-splitting for skew, autoscaling.
9. **Serving and query APIs** — how clients fetch the current top-K per window/day, plus optional streaming push.
10. **Latency vs. accuracy trade-offs** and how you would tune them.
Quick Answer: An Amazon software-engineer system-design screen: design a real-time service that ingests a stream of purchase events (customer_id, item_id, timestamp) and continuously serves the top-K most purchased items over multiple rolling windows (e.g. last 5 minutes, last 1 hour, 24 hours, and daily totals). The answer covers ingestion and partitioning, event-time windowing with watermarks, exact min-heap vs. approximate Count-Min-Sketch / Space-Saving top-K, late/duplicate/out-of-order handling, state and TTL, exactly-once fault tolerance, horizontal scaling, serving APIs, and latency-vs-accuracy trade-offs.