Design a distributed job scheduling system for internal engineering teams. The system should support:
-
one-time jobs scheduled for a future timestamp,
-
recurring jobs with cron-like schedules,
-
job priority,
-
retries with exponential backoff,
-
pause, resume, and cancel operations,
-
job status queries,
-
worker heartbeats and failure recovery,
-
execution logs and audit history.
Assume the system must handle millions of jobs per day, low trigger latency for near-term jobs, horizontal scaling, and high reliability. Explain the APIs, data model, how due jobs are selected and dispatched, how workers safely claim jobs, how to recover from crashes without losing jobs, and how to store and search logs without overloading the primary transactional database.