##### Question Design a horizontally scalable, multi-tenant job scheduler service that supports both one-off / ad-hoc jobs and recurring (cron and fixed-rate) jobs. It enqueues tasks, dispatches them to workers, and provides at-least-once execution with idempotency. Address the following: 1. **Requirements & assumptions.** Functional (cron, ad-hoc, fixed-rate; pause/resume/cancel; priorities; multi-tenant isolation; multi-region) and non-functional (availability, scale to millions of jobs and 10^5+ tasks/min, low scheduler skew, durability). 2. **High-level architecture.** Core components: ingress/API, scheduler/coordinator, durable queue/broker, stateless workers, persistence, and timers. Show how a request flows from creation to execution. 3. **Data model / database schema.** Tables for jobs, schedules, executions/runs, workers, and dead-letters — e.g. id, tenant_id, payload, schedule type/cron/timezone, next_run_at, status, priority, retry policy, dedupe key, shard key, lease fields, version, updated_at. Specify the indexes and partitioning strategy. 4. **Public APIs.** Create, update, pause, resume, cancel, and query jobs; list upcoming jobs; request an immediate run; query executions. Include idempotency keys and optimistic-concurrency (ETag/version) semantics. 5. **Leader election & sharding.** How coordinators partition the job space, elect a per-shard leader (leases + fencing tokens), and avoid split-brain or double-enqueue. 6. **Time source & cron parsing.** UTC internally, IANA timezone per job, DST handling (spring-forward / fall-back policy), and catch-up / backfill guardrails. 7. **Efficiently finding all jobs due in the next 5 minutes at scale.** Indexing on time-ordered keys, range scans with keyset pagination, in-memory min-heaps / hierarchical timing wheels, optional Redis ZSET or per-minute "ready buckets," and contention control (FOR UPDATE SKIP LOCKED, optimistic version CAS). 8. **Reliability.** At-least-once delivery with idempotent handlers, retries with exponential backoff + jitter, deduplication (idempotency keys + dedupe windows), visibility timeouts / leases with heartbeats, and dead-lettering of poison tasks. 9. **Task dependencies (DAG).** Optionally, jobs composed of dependent tasks: track in-degree per run, enqueue dependents as predecessors succeed, and define a failure policy (fail-fast vs continue). 10. **Scaling, consistency, fairness.** Horizontal scaling of each tier; at-least-once vs exactly-once trade-offs and effective-exactly-once patterns; per-tenant quotas, weighted fairness, and starvation avoidance. 11. **Multi-region operation & clock skew.** Region-local execution with shard ownership and failover; clock-skew mitigation (NTP/PTP, monotonic clocks, epsilon windows, generous lease margins). 12. **Observability & failure recovery.** Metrics (scheduler lag, queue depth, retry/DLQ rates, lease loss), tracing by run_id/task_id, alerting, and a recovery playbook for coordinator/worker/queue/DB failures. Provide trade-offs, example queries, and example flows.

This interview question evaluates requirements, scale assumptions, API/data design, architecture, trade-offs, failure modes, and rollout in a realistic interview setting. A strong answer for Design a distributed job scheduler service states assumptions, handles edge cases, explains trade-offs, and shows how to validate the result clearly.

How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a hard difficulty System Design question, commonly asked during Onsite rounds at Amazon.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Amazon during technical interviews.

Design a distributed job scheduler service

Design a horizontally scalable, multi-tenant job scheduler service that supports both one-off / ad-hoc jobs and recurring (cron and fixed-rate) jobs. It enqueues tasks, dispatches them to workers, and provides at-least-once execution with idempotency. Address the following:

Requirements & assumptions. Functional (cron, ad-hoc, fixed-rate; pause/resume/cancel; priorities; multi-tenant isolation; multi-region) and non-functional (availability, scale to millions of jobs and 10^5+ tasks/min, low scheduler skew, durability).
High-level architecture. Core components: ingress/API, scheduler/coordinator, durable queue/broker, stateless workers, persistence, and timers. Show how a request flows from creation to execution.
Data model / database schema. Tables for jobs, schedules, executions/runs, workers, and dead-letters — e.g. id, tenant_id, payload, schedule type/cron/timezone, next_run_at, status, priority, retry policy, dedupe key, shard key, lease fields, version, updated_at. Specify the indexes and partitioning strategy.
Public APIs. Create, update, pause, resume, cancel, and query jobs; list upcoming jobs; request an immediate run; query executions. Include idempotency keys and optimistic-concurrency (ETag/version) semantics.
Leader election & sharding. How coordinators partition the job space, elect a per-shard leader (leases + fencing tokens), and avoid split-brain or double-enqueue.
Time source & cron parsing. UTC internally, IANA timezone per job, DST handling (spring-forward / fall-back policy), and catch-up / backfill guardrails.
Efficiently finding all jobs due in the next 5 minutes at scale. Indexing on time-ordered keys, range scans with keyset pagination, in-memory min-heaps / hierarchical timing wheels, optional Redis ZSET or per-minute "ready buckets," and contention control (FOR UPDATE SKIP LOCKED, optimistic version CAS).
Reliability. At-least-once delivery with idempotent handlers, retries with exponential backoff + jitter, deduplication (idempotency keys + dedupe windows), visibility timeouts / leases with heartbeats, and dead-lettering of poison tasks.
Task dependencies (DAG). Optionally, jobs composed of dependent tasks: track in-degree per run, enqueue dependents as predecessors succeed, and define a failure policy (fail-fast vs continue).
Scaling, consistency, fairness. Horizontal scaling of each tier; at-least-once vs exactly-once trade-offs and effective-exactly-once patterns; per-tenant quotas, weighted fairness, and starvation avoidance.
Multi-region operation & clock skew. Region-local execution with shard ownership and failover; clock-skew mitigation (NTP/PTP, monotonic clocks, epsilon windows, generous lease margins).
Observability & failure recovery. Metrics (scheduler lag, queue depth, retry/DLQ rates, lease loss), tracing by run_id/task_id, alerting, and a recovery playbook for coordinator/worker/queue/DB failures.

Provide trade-offs, example queries, and example flows.

Constraints & Assumptions

Preserve the scope, facts, inputs, and requested outputs from the prompt above.
If the prompt leaves a detail unspecified, state a reasonable assumption before relying on it.
Keep the answer interview-ready: concise enough to present, but concrete enough to implement or evaluate.

Clarifying Questions to Ask

Clarify users, core use cases, read/write patterns, scale, latency, availability, and data retention.
State explicit assumptions before making sizing or architecture decisions.
Prioritize the functional path first, then address reliability, security, observability, and rollout.

What a Strong Answer Covers

A scoped requirements summary with concrete non-goals and success metrics.
API, data model, architecture, consistency, capacity, and operations.
Reasoned trade-offs among simple and scalable designs, including bottlenecks and failure modes.
A validation, monitoring, migration, and launch plan appropriate for the risk level.

Follow-up Questions

What breaks first at 10x traffic or data volume?
How would you degrade gracefully during dependency failures?
What metrics and alerts would prove the design is healthy after launch?