System Design: Distributed Cron Job Scheduler
Context
You are designing a multi-tenant, internet-scale cron job scheduler that triggers tasks according to cron expressions across different time zones. The system must support millions of scheduled jobs, be highly available, and ensure reliable dispatch and execution across a fleet of workers.
Functional Requirements
-
Cron expressions
-
Support standard cron syntax and IANA time zones, including daylight saving time (DST) correctness.
-
Execution semantics
-
Exactly-once or at-least-once delivery semantics (explain trade-offs).
-
Idempotency and deduplication across retries and failovers.
-
Misfire handling and catch-up policies (e.g., skip, fire-latest, fire-all, bounded catch-up).
-
Retries with backoff and jitter.
-
Pause/resume jobs; optional start/end time windows.
-
Job dependencies (e.g., run B after A completes; DAG support is a plus).
-
Worker lifecycle
-
Worker registration, capability discovery, and heartbeats.
-
Safe re-assignment on worker failure.
-
Multi-tenancy
-
Tenant isolation, quotas, priorities, and fair sharing.
-
Security and governance
-
Authentication and authorization (RBAC or ABAC).
-
Auditing of API changes and executions.
-
Observability
-
Metrics, logs, traces, and alerting for SLOs and failure detection.
Non-Functional Requirements
-
High availability with leader election and failover (no single point of failure).
-
Horizontal scalability to millions of jobs and tens of thousands of triggers per second.
-
Low and predictable scheduling latency (e.g., p99 < a few seconds) and bounded scheduler lag.
-
Strong durability for job metadata and execution records.
What to Deliver
Describe the following:
-
Architecture and components (metadata store, scheduler shards, dispatcher, workers, timing wheels or hierarchical timers, message bus, coordination service).
-
Data models (jobs, schedules, executions, attempts, dependencies, workers, tenants, quotas, audit logs).
-
Scheduling and dispatch algorithms (cron parsing, time zone/DST handling, timer data structures, sharding strategy).
-
Consistency choices and delivery semantics (exactly-once vs at-least-once, idempotency, dedup).
-
Failure modes and mitigations (clock skew, network partition, leader/scheduler/worker crashes).
-
Capacity planning (orders of magnitude, partitioning, indexes, throughput estimates).
-
Testing strategies (unit, property, integration, chaos, time-travel, scale testing).