Design a Delayed Job Scheduler (LLD)
Design a service that schedules a job to execute X seconds in the future with second-level accuracy. Produce a low-level design that covers APIs, data structures, persistence, execution model, and operational concerns.
Assume:
-
Second-level precision is sufficient.
-
The system must survive process restarts without losing scheduled jobs.
-
At-least-once delivery is acceptable by default; discuss options for exactly-once.
Requirements
-
APIs
-
schedule(job, delaySeconds) -> jobId
-
scheduleAt(job, epochMillis) -> jobId
-
cancel(jobId) -> success/failure
-
getStatus(jobId) -> job metadata
-
Optional: reschedule(jobId, newDelaySeconds)
-
Data structures: propose and justify (e.g., min-heap by due time, timing wheel, or hybrid) including time/space complexity.
-
Persistence and durability: how jobs are durably stored and recovered after restarts/crashes.
-
Worker execution model: how jobs are dispatched and executed; leasing/ack model.
-
Restarts and clock drift: recovery logic and timekeeping choices.
-
Delivery semantics: idempotency, at-least-once vs exactly-once trade-offs.
-
Concurrency limits: global and per-tenant limits; fairness.
-
Time complexity: for schedule, cancel, due job retrieval.
-
Class diagrams: main classes, relationships, and key methods.
-
Testing: strategy across unit, integration, reliability, and performance.