Design a scalable, fault-tolerant job scheduling system.
The system should allow clients to schedule background jobs (for example, sending emails or running batch computations) to be executed at specific future times, and possibly on a recurring basis.
Then, as a follow-up, design how to efficiently query jobs scheduled in the next N hours in order to power a near real-time dashboard that shows upcoming jobs.
Describe:
-
Functional requirements
(e.g., create/update/cancel jobs, one-time vs recurring jobs, job execution guarantees, etc.).
-
Non-functional requirements
(e.g., scale, latency, reliability, availability, consistency expectations).
-
A
high-level architecture
for the job scheduler:
-
Main components (API layer, scheduler, workers, storage, queues, etc.).
-
How jobs are stored, assigned to workers, and executed at (approximately) the right time.
-
How you ensure reliability (no job lost, minimal duplicates) and fault tolerance.
-
A
data model
for jobs (what fields you store, how you index them).
-
An API and storage/query design to
efficiently fetch all jobs scheduled between now and now + N hours
for a dashboard. Assume:
-
There can be a very large number of jobs.
-
The dashboard needs low-latency, high-QPS reads.
-
N might vary per request (e.g., 1 hour, 6 hours, 24 hours).
Explain your trade-offs, including:
-
How you partition or index data to support both scheduling and time-window queries.
-
How you would avoid full table scans when querying the next N hours.
-
How you would handle hot partitions (e.g., many jobs around the same time) and horizontal scaling.