PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/Amazon

Design a distributed job scheduler service

Last updated: Jun 15, 2026

Quick Overview

A complete walkthrough of designing a distributed, multi-tenant job scheduler for one-off, cron, and fixed-rate jobs — covering the database schema and indexes, public APIs, scheduler/queue/worker architecture, leader election, at-least-once execution with idempotency and retries, and multi-region and clock-skew handling. It dives deep into the classic interview sub-problem of efficiently finding every job due in the next five minutes using time-ordered indexes, timing wheels, and FOR UPDATE SKIP LOCKED leasing.

  • hard
  • Amazon
  • System Design
  • Software Engineer

Design a distributed job scheduler service

Company: Amazon

Role: Software Engineer

Category: System Design

Difficulty: hard

Interview Round: Onsite

##### Question Design a horizontally scalable, multi-tenant job scheduler service that supports both one-off / ad-hoc jobs and recurring (cron and fixed-rate) jobs. It enqueues tasks, dispatches them to workers, and provides at-least-once execution with idempotency. Address the following: 1. **Requirements & assumptions.** Functional (cron, ad-hoc, fixed-rate; pause/resume/cancel; priorities; multi-tenant isolation; multi-region) and non-functional (availability, scale to millions of jobs and 10^5+ tasks/min, low scheduler skew, durability). 2. **High-level architecture.** Core components: ingress/API, scheduler/coordinator, durable queue/broker, stateless workers, persistence, and timers. Show how a request flows from creation to execution. 3. **Data model / database schema.** Tables for jobs, schedules, executions/runs, workers, and dead-letters — e.g. id, tenant_id, payload, schedule type/cron/timezone, next_run_at, status, priority, retry policy, dedupe key, shard key, lease fields, version, updated_at. Specify the indexes and partitioning strategy. 4. **Public APIs.** Create, update, pause, resume, cancel, and query jobs; list upcoming jobs; request an immediate run; query executions. Include idempotency keys and optimistic-concurrency (ETag/version) semantics. 5. **Leader election & sharding.** How coordinators partition the job space, elect a per-shard leader (leases + fencing tokens), and avoid split-brain or double-enqueue. 6. **Time source & cron parsing.** UTC internally, IANA timezone per job, DST handling (spring-forward / fall-back policy), and catch-up / backfill guardrails. 7. **Efficiently finding all jobs due in the next 5 minutes at scale.** Indexing on time-ordered keys, range scans with keyset pagination, in-memory min-heaps / hierarchical timing wheels, optional Redis ZSET or per-minute "ready buckets," and contention control (FOR UPDATE SKIP LOCKED, optimistic version CAS). 8. **Reliability.** At-least-once delivery with idempotent handlers, retries with exponential backoff + jitter, deduplication (idempotency keys + dedupe windows), visibility timeouts / leases with heartbeats, and dead-lettering of poison tasks. 9. **Task dependencies (DAG).** Optionally, jobs composed of dependent tasks: track in-degree per run, enqueue dependents as predecessors succeed, and define a failure policy (fail-fast vs continue). 10. **Scaling, consistency, fairness.** Horizontal scaling of each tier; at-least-once vs exactly-once trade-offs and effective-exactly-once patterns; per-tenant quotas, weighted fairness, and starvation avoidance. 11. **Multi-region operation & clock skew.** Region-local execution with shard ownership and failover; clock-skew mitigation (NTP/PTP, monotonic clocks, epsilon windows, generous lease margins). 12. **Observability & failure recovery.** Metrics (scheduler lag, queue depth, retry/DLQ rates, lease loss), tracing by run_id/task_id, alerting, and a recovery playbook for coordinator/worker/queue/DB failures. Provide trade-offs, example queries, and example flows.

Quick Answer: A complete walkthrough of designing a distributed, multi-tenant job scheduler for one-off, cron, and fixed-rate jobs — covering the database schema and indexes, public APIs, scheduler/queue/worker architecture, leader election, at-least-once execution with idempotency and retries, and multi-region and clock-skew handling. It dives deep into the classic interview sub-problem of efficiently finding every job due in the next five minutes using time-ordered indexes, timing wheels, and FOR UPDATE SKIP LOCKED leasing.

Related Interview Questions

  • Design a Log Collection System - Amazon (medium)
  • Design Human Avoidance for Warehouse Robots - Amazon (medium)
  • Design a High-Availability Load Balancer - Amazon (hard)
  • Design a Ride-Hailing Matching System - Amazon (medium)
  • Design a cloud database write path and recovery - Amazon (hard)
Amazon logo
Amazon
Aug 10, 2025, 12:00 AM
Software Engineer
Onsite
System Design
2
0
Question

Design a horizontally scalable, multi-tenant job scheduler service that supports both one-off / ad-hoc jobs and recurring (cron and fixed-rate) jobs. It enqueues tasks, dispatches them to workers, and provides at-least-once execution with idempotency. Address the following:

  1. Requirements & assumptions. Functional (cron, ad-hoc, fixed-rate; pause/resume/cancel; priorities; multi-tenant isolation; multi-region) and non-functional (availability, scale to millions of jobs and 10^5+ tasks/min, low scheduler skew, durability).
  2. High-level architecture. Core components: ingress/API, scheduler/coordinator, durable queue/broker, stateless workers, persistence, and timers. Show how a request flows from creation to execution.
  3. Data model / database schema. Tables for jobs, schedules, executions/runs, workers, and dead-letters — e.g. id, tenant_id, payload, schedule type/cron/timezone, next_run_at, status, priority, retry policy, dedupe key, shard key, lease fields, version, updated_at. Specify the indexes and partitioning strategy.
  4. Public APIs. Create, update, pause, resume, cancel, and query jobs; list upcoming jobs; request an immediate run; query executions. Include idempotency keys and optimistic-concurrency (ETag/version) semantics.
  5. Leader election & sharding. How coordinators partition the job space, elect a per-shard leader (leases + fencing tokens), and avoid split-brain or double-enqueue.
  6. Time source & cron parsing. UTC internally, IANA timezone per job, DST handling (spring-forward / fall-back policy), and catch-up / backfill guardrails.
  7. Efficiently finding all jobs due in the next 5 minutes at scale. Indexing on time-ordered keys, range scans with keyset pagination, in-memory min-heaps / hierarchical timing wheels, optional Redis ZSET or per-minute "ready buckets," and contention control (FOR UPDATE SKIP LOCKED, optimistic version CAS).
  8. Reliability. At-least-once delivery with idempotent handlers, retries with exponential backoff + jitter, deduplication (idempotency keys + dedupe windows), visibility timeouts / leases with heartbeats, and dead-lettering of poison tasks.
  9. Task dependencies (DAG). Optionally, jobs composed of dependent tasks: track in-degree per run, enqueue dependents as predecessors succeed, and define a failure policy (fail-fast vs continue).
  10. Scaling, consistency, fairness. Horizontal scaling of each tier; at-least-once vs exactly-once trade-offs and effective-exactly-once patterns; per-tenant quotas, weighted fairness, and starvation avoidance.
  11. Multi-region operation & clock skew. Region-local execution with shard ownership and failover; clock-skew mitigation (NTP/PTP, monotonic clocks, epsilon windows, generous lease margins).
  12. Observability & failure recovery. Metrics (scheduler lag, queue depth, retry/DLQ rates, lease loss), tracing by run_id/task_id, alerting, and a recovery playbook for coordinator/worker/queue/DB failures.

Provide trade-offs, example queries, and example flows.

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Amazon•More Software Engineer•Amazon Software Engineer•Amazon System Design•Software Engineer System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.