Design a distributed job scheduler service
Company: Amazon
Role: Software Engineer
Category: System Design
Difficulty: hard
Interview Round: Onsite
##### Question
Design a horizontally scalable, multi-tenant job scheduler service that supports both one-off / ad-hoc jobs and recurring (cron and fixed-rate) jobs. It enqueues tasks, dispatches them to workers, and provides at-least-once execution with idempotency. Address the following:
1. **Requirements & assumptions.** Functional (cron, ad-hoc, fixed-rate; pause/resume/cancel; priorities; multi-tenant isolation; multi-region) and non-functional (availability, scale to millions of jobs and 10^5+ tasks/min, low scheduler skew, durability).
2. **High-level architecture.** Core components: ingress/API, scheduler/coordinator, durable queue/broker, stateless workers, persistence, and timers. Show how a request flows from creation to execution.
3. **Data model / database schema.** Tables for jobs, schedules, executions/runs, workers, and dead-letters — e.g. id, tenant_id, payload, schedule type/cron/timezone, next_run_at, status, priority, retry policy, dedupe key, shard key, lease fields, version, updated_at. Specify the indexes and partitioning strategy.
4. **Public APIs.** Create, update, pause, resume, cancel, and query jobs; list upcoming jobs; request an immediate run; query executions. Include idempotency keys and optimistic-concurrency (ETag/version) semantics.
5. **Leader election & sharding.** How coordinators partition the job space, elect a per-shard leader (leases + fencing tokens), and avoid split-brain or double-enqueue.
6. **Time source & cron parsing.** UTC internally, IANA timezone per job, DST handling (spring-forward / fall-back policy), and catch-up / backfill guardrails.
7. **Efficiently finding all jobs due in the next 5 minutes at scale.** Indexing on time-ordered keys, range scans with keyset pagination, in-memory min-heaps / hierarchical timing wheels, optional Redis ZSET or per-minute "ready buckets," and contention control (FOR UPDATE SKIP LOCKED, optimistic version CAS).
8. **Reliability.** At-least-once delivery with idempotent handlers, retries with exponential backoff + jitter, deduplication (idempotency keys + dedupe windows), visibility timeouts / leases with heartbeats, and dead-lettering of poison tasks.
9. **Task dependencies (DAG).** Optionally, jobs composed of dependent tasks: track in-degree per run, enqueue dependents as predecessors succeed, and define a failure policy (fail-fast vs continue).
10. **Scaling, consistency, fairness.** Horizontal scaling of each tier; at-least-once vs exactly-once trade-offs and effective-exactly-once patterns; per-tenant quotas, weighted fairness, and starvation avoidance.
11. **Multi-region operation & clock skew.** Region-local execution with shard ownership and failover; clock-skew mitigation (NTP/PTP, monotonic clocks, epsilon windows, generous lease margins).
12. **Observability & failure recovery.** Metrics (scheduler lag, queue depth, retry/DLQ rates, lease loss), tracing by run_id/task_id, alerting, and a recovery playbook for coordinator/worker/queue/DB failures.
Provide trade-offs, example queries, and example flows.
Quick Answer: A complete walkthrough of designing a distributed, multi-tenant job scheduler for one-off, cron, and fixed-rate jobs — covering the database schema and indexes, public APIs, scheduler/queue/worker architecture, leader election, at-least-once execution with idempotency and retries, and multi-region and clock-skew handling. It dives deep into the classic interview sub-problem of efficiently finding every job due in the next five minutes using time-ordered indexes, timing wheels, and FOR UPDATE SKIP LOCKED leasing.