Design a distributed job scheduler system that can run background jobs at specific times or on recurring schedules (similar to cron but scalable and fault-tolerant).
The system should support:
-
One-time jobs scheduled to run at a specific timestamp.
-
Recurring jobs (e.g., "run every 5 minutes", "run every day at 1 AM").
-
Reliable execution so that each job runs at least once, and preferably exactly once where possible.
-
Horizontal scalability and high availability.
Assume clients (internal services or users) can:
-
Create, update, and delete jobs.
-
Query job status and execution history.
Design the system end-to-end. Cover:
-
High-level architecture and main components.
-
How you store job definitions and schedules.
-
How scheduling (deciding
when
a job should run) is done in a distributed setting.
-
How workers pick up and execute jobs.
-
How to ensure fault tolerance and avoid duplicate executions as much as possible.
-
How to scale the system as the number of jobs and execution frequency grows.