System Design: Production-Ready Event-Timeout Detector
You are designing a production service that detects when an event fails to occur within a specified duration (a timeout). For example, after a Start event for event_id=X, if a matching Stop or Heartbeat does not arrive within T, emit a Timeout for X.
Define and evaluate a production-ready architecture that addresses the following:
External API and Data Model
-
Specify request/response APIs to:
-
Create/schedule a timeout for an event_id
-
Cancel or complete the event
-
Extend or heartbeat the timeout
-
Query the current status
-
Receive timeout notifications (push or pull)
-
Provide data schemas, including idempotency keys and dedup strategy.
Time Semantics and Clock Skew
-
Define whether timeouts are evaluated in event time or processing time, and the rationale.
-
Describe watermarking/allowed lateness if you choose event time.
-
Explain how to handle client/server clock skew, including guardrails.
Partitioning, State, Consistency, Recovery
-
Sharding strategy by event_id, partition ownership, and rebalancing.
-
State storage choice (in-memory vs local persistent vs remote KV) and what is stored.
-
Consistency model for reads (linearizable vs eventual) and for emitted timeouts.
-
Failure recovery: restarts, replays, snapshots, and exactly-once/at-least-once guarantees.
Performance and Resilience
-
Data structures for timers (timer wheels vs min-heap vs multi-level wheels), batching, and LRU/near-term caching.
-
Backpressure strategy across API and ingestion pipeline.
-
Memory limits and spill-to-disk strategy.
Production Edge Cases
-
Late, dropped, duplicated, and reordered messages
-
Very large timeouts (hours/days), long silences/inactivity
-
Service restarts and partition moves
-
Idempotency and side-effect delivery (e.g., notifications)
Testing, Monitoring, and Alerting
-
How you would test (unit, property, integration, chaos, load/soak) and what to validate.
-
Key metrics, logs, traces, dashboards, and alert thresholds.