This question evaluates system design and distributed-systems competencies, including scheduling algorithms, durable persistence and recovery, concurrency control, delivery semantics, and operational concerns, and it belongs to the System Design category.
Design a system that schedules a job to execute X seconds in the future (LLD). Define APIs (e.g., schedule(job, delaySeconds) -> jobId, cancel(jobId)), data structures (e.g., time wheel, min-heap by due time), persistence for durability, worker execution model, handling of restarts and clock drift, idempotency/exactly-once vs at-least-once semantics, concurrency limits, and time complexity. Provide class diagrams and discuss how you would test it.
Quick Answer: This question evaluates system design and distributed-systems competencies, including scheduling algorithms, durable persistence and recovery, concurrency control, delivery semantics, and operational concerns, and it belongs to the System Design category.
Design a service that schedules a job to execute X seconds in the future with second-level accuracy. Produce a low-level design covering the public APIs, the in-memory and on-disk data structures, persistence, the execution model, and the operational concerns enumerated below.
This is an LLD prompt: the interviewer wants concrete APIs, data structures with their time/space complexity, a durable schema, an explicit state machine, an execution/lease protocol, and class responsibilities — not a boxes-and-arrows HLD diagram.
Constraints & Assumptions
Second-level precision
is sufficient; sub-second (tens-of-ms) jitter is acceptable. The system should never fire a job
early
relative to its due time — lateness is the only acceptable error.
The system
must survive process restarts and crashes
without losing scheduled jobs: a
schedule()
call that returns success must be durable even if the process dies one millisecond later.
At-least-once delivery
is acceptable by default; you should also discuss what it would take to move toward exactly-once.
Assume a target of up to roughly
106
–
107
pending jobs with low-thousands schedule QPS, and that
multiple scheduler instances
run for high availability.
Clarifying Questions to Ask
Before designing, scope the problem with the interviewer:
What is the acceptable
firing accuracy
and is it one-sided (never early vs. bounded lateness)?
What
delivery semantics
are required — at-least-once, or must we attempt exactly-once?
How do jobs
execute
— an in-process handler, an HTTP callback, or a message onto a queue? Are handlers idempotent?
What is the expected
scale
: number of pending jobs, schedule QPS, and maximum delay horizon (seconds vs. months)?
Is this
single-tenant or multi-tenant
? Do we need per-tenant fairness and quotas?
What are the
availability
expectations (single node vs. HA across instances)?
What to Design
Part 1 — Public APIs
Define and justify the public interface, including at least:
schedule(job, delaySeconds) -> jobId
scheduleAt(job, epochMillis) -> jobId
cancel(jobId) -> success/failure
getStatus(jobId) -> job metadata
Optional:reschedule(jobId, newDelaySeconds)
Specify the options each call accepts, what getStatus returns, and how you validate inputs.
What This Part Should Cover Premium
Part 2 — Data structures
Propose and justify your in-memory scheduling structure. Compare candidates such as a min-heap by due time, a timing wheel, or a hybrid, giving the time and space complexity of each and explaining why you chose yours.
What This Part Should Cover Premium
Part 3 — Persistence and durability
Describe how jobs are durably stored and recovered after restarts or crashes. Give a concrete schema (columns, indexes) and state exactly when the client is acknowledged relative to the durable write.
What This Part Should Cover Premium
Part 4 — Worker execution model
Explain how due jobs are dispatched and executed, including the leasing / ack protocol: how a worker claims a job, what happens on success, on handler failure, and on worker crash.
What This Part Should Cover Premium
Part 5 — Restarts and clock drift
Describe the recovery logic on startup, and your timekeeping choices (wall-clock vs. monotonic) — including how you handle clock drift and NTP steps (forward and backward).
What This Part Should Cover Premium
Part 6 — Delivery semantics
Discuss idempotency and the at-least-once vs. exactly-once trade-offs. State your default and what (if anything) makes true exactly-once achievable.
What This Part Should Cover Premium
Part 7 — Concurrency limits
Describe global and per-tenant limits, plus how you ensure fairness across tenants under backlog.
What This Part Should Cover Premium
Part 8 — Time complexity
Give the time complexity for schedule, cancel, and due-job retrieval, distinguishing the in-memory path from the database path.
What This Part Should Cover Premium
Part 9 — Class diagram
Lay out the main classes, their relationships, and key methods (e.g., the store, the per-shard timer, the dispatcher, the lease manager, the worker pool, the concurrency limiter, the clock).
What This Part Should Cover Premium
Part 10 — Testing
Provide a testing strategy spanning unit, integration, reliability/durability, and performance tiers.
What This Part Should Cover Premium
What a Strong Answer Covers Premium
Follow-up Questions
Be ready for the interviewer to push further:
How does the design change at
100x scale
(e.g.
109
pending jobs or a 10x QPS spike)? What breaks first — the DB index, the loader scan, or worker throughput?
After a
long outage
, the due-time scan returns a massive overdue backlog at once (Part 5). How do you drain it without an OOM or a thundering herd on downstreams?
During an
owner failover
, two scheduler instances briefly target the same shard. What guarantees a given job still executes at most-once-ish (no double-execution beyond the at-least-once budget of Part 6)?
A downstream endpoint starts
timing out
. How do retries, backoff, per-endpoint concurrency caps (Part 7), and the dead-letter path interact to avoid amplifying the outage?