Design a Job Scheduling Service
You are designing a multi-tenant job scheduling service that runs one-off and recurring background jobs at scale. The service should expose APIs to manage jobs and reliably execute them via a distributed scheduler/worker architecture.
Assume:
-
Jobs can be internal tasks or HTTP callbacks.
-
At-least-once execution semantics are acceptable; idempotency is required for correctness.
-
The system must support millions of scheduled jobs and high throughput dispatch.
Requirements
-
Data model/schema for jobs (one-off and recurring), including retry policy, priority, time zone, and execution metadata.
-
APIs to create, update, pause, resume, and cancel jobs.
-
Execution architecture: scheduler, dispatcher, workers; leases and state transitions.
-
Reliability: idempotency, retries/backoff, deduplication, timeouts, dead-lettering.
-
Time handling: time zones, DST, and clock skew.
-
Scaling/sharding and fairness across tenants/priorities.
-
Monitoring and operability.
-
Optimize for efficiently retrieving all jobs scheduled to run in the next five minutes: indexing/partitioning strategies, example queries, and handling high throughput.