Design a job scheduler for a small startup, then explain how you would evolve it to support roughly 100x more scheduled tasks.
The system should:
-
Allow users or internal services to create one-time and recurring jobs.
-
Run jobs at the scheduled time with reasonable accuracy.
-
Support retries for failed jobs.
-
Support cancellation and rescheduling.
-
Track job states such as pending, dispatched, running, succeeded, and failed.
-
Provide basic observability for operators.
Discuss:
-
API design and core data model.
-
How jobs are stored and selected for execution.
-
How workers execute jobs safely.
-
How to handle duplicate execution, retries, and idempotency.
-
Failure handling when the scheduler node or worker crashes.
-
How the initial design for a low-scale startup would work.
-
What architectural changes you would make to scale to 100x more jobs.