System Design: Near-Real-Time Activity Counting Service
Context
Build a service that ingests high-throughput client events and provides near-real-time aggregations of activity counts per user, device, and region. The system must support time-windowed queries (tumbling and sliding), deduplication/idempotency, hot-key sharding, and privacy-by-design. It should be resilient, observable, and support backfill/reprocessing.
Assume the service is multi-tenant and globally deployed with regional data residency. Reads should be near-real-time (seconds), writes are very high-throughput, and clients may be offline and sync later.
Requirements
-
APIs
-
increment(key, timestamp)
-
getCount(key)
-
getCount(key, timeWindow)
-
getUniqueActors(key, timeWindow)
-
System properties
-
High write throughput with near-real-time reads
-
Idempotency and deduplication for retries/replays
-
Discuss exactly-once vs at-least-once delivery tradeoffs
-
Time-windowed aggregations: tumbling and sliding
-
Hot-key sharding to avoid partition hotspots
-
Storage choices (e.g., write-optimized store + aggregation layers)
-
Offline client buffering and sync
-
Retention and TTL policies
-
Backfill and reprocessing strategy
-
Privacy considerations
-
Monitoring and alerting
-
Capacity estimate (state assumptions) and failure modes with mitigations