Discuss productionizing event-timeout detector

Q: Discuss productionizing event-timeout detector

This is a System Design interview question from Applied Intuition for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

System Design: Production-Ready Event-Timeout Detector

You are designing a production service that detects when an event fails to occur within a specified duration (a timeout). For example, after a Start event for event_id=X, if a matching Stop or Heartbeat does not arrive within T, emit a Timeout for X.

Define and evaluate a production-ready architecture that addresses the following:

External API and Data Model

Specify request/response APIs to:
1. Create/schedule a timeout for an event_id
2. Cancel or complete the event
3. Extend or heartbeat the timeout
4. Query the current status
5. Receive timeout notifications (push or pull)
Provide data schemas, including idempotency keys and dedup strategy.

Time Semantics and Clock Skew

Define whether timeouts are evaluated in event time or processing time, and the rationale.
Describe watermarking/allowed lateness if you choose event time.
Explain how to handle client/server clock skew, including guardrails.

Partitioning, State, Consistency, Recovery

Sharding strategy by event_id, partition ownership, and rebalancing.
State storage choice (in-memory vs local persistent vs remote KV) and what is stored.
Consistency model for reads (linearizable vs eventual) and for emitted timeouts.
Failure recovery: restarts, replays, snapshots, and exactly-once/at-least-once guarantees.

Performance and Resilience

Data structures for timers (timer wheels vs min-heap vs multi-level wheels), batching, and LRU/near-term caching.
Backpressure strategy across API and ingestion pipeline.
Memory limits and spill-to-disk strategy.

Production Edge Cases

Late, dropped, duplicated, and reordered messages
Very large timeouts (hours/days), long silences/inactivity
Service restarts and partition moves
Idempotency and side-effect delivery (e.g., notifications)

Testing, Monitoring, and Alerting

How you would test (unit, property, integration, chaos, load/soak) and what to validate.
Key metrics, logs, traces, dashboards, and alert thresholds.