System Design: Distributed Job Scheduler
Context
Design a horizontally scalable, multi-tenant distributed job scheduler that supports cron-based recurring jobs and ad‑hoc one-off jobs. The system must enqueue tasks, dispatch them to workers, and provide at‑least‑once execution. Jobs must be idempotent, support priorities and dependencies, and run reliably across failures and regions.
Requirements
-
Architecture
-
Coordinator/scheduler, durable queue, workers
-
Leader election and shard ownership
-
Time source, timezone handling, cron expression parsing
-
Job metadata storage and persistence
-
Execution semantics
-
At‑least‑once execution with idempotency guarantees
-
Retries with exponential backoff (and jitter)
-
Priorities and fair multi-tenant scheduling
-
Deduplication and idempotency keys
-
Task dependencies (DAG) and run orchestration
-
Operations and reliability
-
Observability: metrics, logs, traces; dashboards and alerts
-
Failure recovery for coordinator, queue, worker, and storage failures
-
Horizontal scalability and partitioning
-
Multi-tenant isolation (quotas, RBAC)
-
Multi‑region operation (failover/active-active) and time consistency
-
Interfaces
-
Define public APIs (create/update/pause/resume/delete jobs, trigger ad‑hoc runs, list runs)
-
Define worker protocol (poll, lease/heartbeat, ack/nack, extend visibility)
-
Data model for Jobs, Runs, Tasks, Attempts, Leases, Tenants
-
Flows
-
End-to-end flows for cron scheduling, ad‑hoc triggering, dispatch, retry, completion, and status tracking.