Scenario
Design a distributed job scheduler that runs scheduled (time-based) jobs only (e.g., cron/interval). Each job execution has an SLA: if the run does not finish within the SLA window, the system must mark it as timed out and raise an error/alert. Users must be able to inspect execution logs.
Core requirements
-
Create/update/delete scheduled jobs.
-
Trigger job runs at the correct times (support cron-like schedules or fixed intervals).
-
Execute jobs on a fleet of workers.
-
Track job run state:
PENDING/RUNNING/SUCCEEDED/FAILED/TIMED_OUT/CANCELED
.
-
SLA enforcement
per run: detect overruns and surface as an error/alert.
-
Provide
log viewing
per job run.
Non-functional requirements (clarify in discussion)
-
Scale: potentially many jobs and high execution rate.
-
Reliability: tolerate worker crashes, scheduler crashes, network partitions.
-
Correctness: avoid missed schedules; minimize duplicate runs (define acceptable semantics).
-
Observability: metrics, tracing, auditing.
Follow-ups to address
-
How to handle failures at each stage (scheduler failure, worker failure, log pipeline failure, DB outages).
-
Retries/backoff, dead-lettering, and idempotency.
-
Handling long-running jobs and cancellation.
-
Handling clock skew/time zones and daylight savings.