PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/System Design/Amazon

Design delayed job scheduler (LLD)

Last updated: Jun 26, 2026

Quick Overview

This question evaluates system design and distributed-systems competencies, including scheduling algorithms, durable persistence and recovery, concurrency control, delivery semantics, and operational concerns, and it belongs to the System Design category.

  • hard
  • Amazon
  • System Design
  • Software Engineer

Design delayed job scheduler (LLD)

Company: Amazon

Role: Software Engineer

Category: System Design

Difficulty: hard

Interview Round: Onsite

Design a system that schedules a job to execute X seconds in the future (LLD). Define APIs (e.g., schedule(job, delaySeconds) -> jobId, cancel(jobId)), data structures (e.g., time wheel, min-heap by due time), persistence for durability, worker execution model, handling of restarts and clock drift, idempotency/exactly-once vs at-least-once semantics, concurrency limits, and time complexity. Provide class diagrams and discuss how you would test it.

Quick Answer: This question evaluates system design and distributed-systems competencies, including scheduling algorithms, durable persistence and recovery, concurrency control, delivery semantics, and operational concerns, and it belongs to the System Design category.

Related Interview Questions

  • Design a Library Management System (API and Schema) - Amazon (hard)
  • Design a Perishable-Goods Inventory and Location Tracking System - Amazon (medium)
  • Design a Log Collection System - Amazon (medium)
  • Design Human Avoidance for Warehouse Robots - Amazon (medium)
  • Design a High-Availability Load Balancer - Amazon (hard)
|Home/System Design/Amazon

Design delayed job scheduler (LLD)

Amazon logo
Amazon
Aug 7, 2025, 12:00 AM
hardSoftware EngineerOnsiteSystem Design
24
0

Design a Delayed Job Scheduler (Low-Level Design)

Design a service that schedules a job to execute X seconds in the future with second-level accuracy. Produce a low-level design covering the public APIs, the in-memory and on-disk data structures, persistence, the execution model, and the operational concerns enumerated below.

This is an LLD prompt: the interviewer wants concrete APIs, data structures with their time/space complexity, a durable schema, an explicit state machine, an execution/lease protocol, and class responsibilities — not a boxes-and-arrows HLD diagram.

Constraints & Assumptions

  • Second-level precision is sufficient; sub-second (tens-of-ms) jitter is acceptable. The system should never fire a job early relative to its due time — lateness is the only acceptable error.
  • The system must survive process restarts and crashes without losing scheduled jobs: a schedule() call that returns success must be durable even if the process dies one millisecond later.
  • At-least-once delivery is acceptable by default; you should also discuss what it would take to move toward exactly-once.
  • Assume a target of up to roughly 10610^6106 – 10710^7107 pending jobs with low-thousands schedule QPS, and that multiple scheduler instances run for high availability.

Clarifying Questions to Ask

Before designing, scope the problem with the interviewer:

  • What is the acceptable firing accuracy and is it one-sided (never early vs. bounded lateness)?
  • What delivery semantics are required — at-least-once, or must we attempt exactly-once?
  • How do jobs execute — an in-process handler, an HTTP callback, or a message onto a queue? Are handlers idempotent?
  • What is the expected scale : number of pending jobs, schedule QPS, and maximum delay horizon (seconds vs. months)?
  • Is this single-tenant or multi-tenant ? Do we need per-tenant fairness and quotas?
  • What are the availability expectations (single node vs. HA across instances)?

What to Design

Part 1 — Public APIs

Define and justify the public interface, including at least:

  • schedule(job, delaySeconds) -> jobId
  • scheduleAt(job, epochMillis) -> jobId
  • cancel(jobId) -> success/failure
  • getStatus(jobId) -> job metadata
  • Optional: reschedule(jobId, newDelaySeconds)

Specify the options each call accepts, what getStatus returns, and how you validate inputs.

What This Part Should Cover Premium

Part 2 — Data structures

Propose and justify your in-memory scheduling structure. Compare candidates such as a min-heap by due time, a timing wheel, or a hybrid, giving the time and space complexity of each and explaining why you chose yours.

What This Part Should Cover Premium

Part 3 — Persistence and durability

Describe how jobs are durably stored and recovered after restarts or crashes. Give a concrete schema (columns, indexes) and state exactly when the client is acknowledged relative to the durable write.

What This Part Should Cover Premium

Part 4 — Worker execution model

Explain how due jobs are dispatched and executed, including the leasing / ack protocol: how a worker claims a job, what happens on success, on handler failure, and on worker crash.

What This Part Should Cover Premium

Part 5 — Restarts and clock drift

Describe the recovery logic on startup, and your timekeeping choices (wall-clock vs. monotonic) — including how you handle clock drift and NTP steps (forward and backward).

What This Part Should Cover Premium

Part 6 — Delivery semantics

Discuss idempotency and the at-least-once vs. exactly-once trade-offs. State your default and what (if anything) makes true exactly-once achievable.

What This Part Should Cover Premium

Part 7 — Concurrency limits

Describe global and per-tenant limits, plus how you ensure fairness across tenants under backlog.

What This Part Should Cover Premium

Part 8 — Time complexity

Give the time complexity for schedule, cancel, and due-job retrieval, distinguishing the in-memory path from the database path.

What This Part Should Cover Premium

Part 9 — Class diagram

Lay out the main classes, their relationships, and key methods (e.g., the store, the per-shard timer, the dispatcher, the lease manager, the worker pool, the concurrency limiter, the clock).

What This Part Should Cover Premium

Part 10 — Testing

Provide a testing strategy spanning unit, integration, reliability/durability, and performance tiers.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Be ready for the interviewer to push further:

  • How does the design change at 100x scale (e.g. 10910^9109 pending jobs or a 10x QPS spike)? What breaks first — the DB index, the loader scan, or worker throughput?
  • After a long outage , the due-time scan returns a massive overdue backlog at once (Part 5). How do you drain it without an OOM or a thundering herd on downstreams?
  • During an owner failover , two scheduler instances briefly target the same shard. What guarantees a given job still executes at most-once-ish (no double-execution beyond the at-least-once budget of Part 6)?
  • A downstream endpoint starts timing out . How do retries, backoff, per-endpoint concurrency caps (Part 7), and the dead-letter path interact to avoid amplifying the outage?

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Amazon•More Software Engineer•Amazon Software Engineer•Amazon System Design•Software Engineer System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.