Design a distributed job scheduler
Company: Robinhood
Role: Software Engineer
Category: System Design
Interview Round: Onsite
Design a **distributed job scheduling system** for internal engineering teams (think "cron as a service"). Teams register jobs through an API; the platform reliably triggers them at the right time, runs them on a pool of workers, and lets owners observe, retry, and manage them.
The system must support:
- **One-time jobs** scheduled for a specific future timestamp.
- **Recurring jobs** with cron-like schedules.
- **Per-job priority**.
- **Retries with exponential backoff** and a dead-letter path after max attempts.
- **Pause, resume, and cancel** operations on a job.
- **Job status and run-history queries**.
- **Worker heartbeats and failure recovery** (a crashed worker's job must not be lost).
- **Execution logs and audit history** that owners can search.
Walk through the **APIs**, the **data model**, how **due jobs are selected and dispatched**, how **workers safely claim jobs**, how the system **recovers from scheduler and worker crashes without losing jobs**, and how to **store and search execution logs** without overloading the primary transactional database.
```hint Where to start — what do you persist?
A recurring "every minute" job runs forever, so you clearly can't store a row per future occurrence. What is the *minimal* durable record that still guarantees no run is ever lost? And as a run actually executes, what transient state shows up — which worker has it, which attempt this is, did it succeed — that doesn't belong on the long-lived record? Drawing this line cleanly is what makes recurring jobs cheap and crash recovery tractable; getting it wrong forces you to either duplicate definitions or lose per-run history.
```
```hint Finding what's due — at a few hundred/sec
Suppose you scan the whole job table every second asking "what's due now?" As the table grows, what fraction of every scan is wasted re-reading jobs scheduled for next week or next year, and what breaks first? Push on three pressures: how to stop re-reading cold far-future rows, how to spread the scanning across many scheduler instances without two of them firing the same job, and how to get sub-second trigger precision once you've narrowed down to the handful of jobs due in the next few seconds.
```
```hint Handing a job to a worker — and getting it back if the worker dies
A worker grabs a job and then its process vanishes mid-run. A permanently-held lock would strand that job forever, but blindly handing it to a second worker risks running it twice at once. What mechanism lets ownership *expire* and be re-claimed by someone else only after the original owner has clearly gone silent — and how does a still-healthy long-running worker signal "I'm alive, don't take this from me"? Once you admit that a job can be re-delivered after a crash, what does that force you to promise about delivery, and what must the job's handler then tolerate?
```
```hint Two failure modes that bite later
(1) Creating a run involves two stores: you record it durably, then you push it onto a queue for a worker. What happens if the process dies *between* those two steps? Reason about how to make "record it" and "enqueue it" recoverable so you never end up with a lost trigger or a phantom one. (2) Execution logs are orders of magnitude larger than the job metadata. If you put them in the transactional database, it bloats fast — so where do the bulk logs live instead, what tiny reference stays behind so owners can still find them, and how do you keep them *searchable* by job, tenant, and error?
```
### Constraints & Assumptions
- Scale: **millions of jobs triggered per day** (assume a few hundred to low thousands of triggers/second at peak), tens of thousands of distinct job definitions, and a worker fleet in the hundreds.
- **Low trigger latency** for near-term jobs: a job due at time $T$ should start within a small bounded delay (e.g. a couple of seconds), not minutes.
- **Durability is non-negotiable**: once the API acknowledges a job, it must never be silently dropped, even across scheduler/worker/DB restarts.
- The platform is **horizontally scalable** and multi-tenant (jobs belong to teams/`tenant_id`); one noisy team must not starve others.
- Job payloads are small (a command, container ref, or HTTP target + args); the work itself runs inside the worker, not the scheduler.
- Execution logs (stdout/stderr/structured events) can be **orders of magnitude larger** than the job metadata.
### Clarifying Questions to Ask
- What delivery semantics are acceptable — at-least-once with idempotent handlers, or is exactly-once a hard requirement?
- How precise must trigger timing be (sub-second, single-digit seconds, best-effort minutes)?
- Are payloads bounded in size, and does the scheduler execute the work or only dispatch it to workers?
- What is the retention requirement for execution logs and audit history?
- Is the system multi-tenant, and do we need fairness/isolation guarantees between teams?
- How long can the longest-running job run, and do we need to support job timeouts and cancellation of in-flight runs?
### What a Strong Answer Covers
- A principled split between the **durable job record** and the **per-run record**, and why that split keeps recurring jobs cheap.
- A **scheduling strategy** that avoids full-table scans and explains the latency/cost trade-off versus naive polling.
- A concrete **data model** with the fields that drive scheduling, claiming, retries, and status.
- The **read/write path**: durable write before acknowledging creation; how due jobs reach workers; how recurring jobs compute the next occurrence.
- **Worker safety**: how a job is claimed, how a crashed worker's job is recovered, and an explicit, defended **delivery-guarantee** argument.
- **Failure recovery** for scheduler, worker, queue, and DB — including how scheduler work is divided and the write-to-two-stores consistency problem.
- **Scaling levers**: partitioned work, priority without starvation, cold/hot job separation, per-tenant fairness, worker autoscaling.
- **Log/observability design** that keeps bulk logs out of the OLTP database while remaining searchable, plus the key metrics to alert on.
### Follow-up Questions
- A team's recurring job runs every minute but each run takes 90 seconds. How do you handle overlapping runs, and what policy do you expose (skip, queue, allow-concurrent)?
- A bug causes thousands of jobs to fail and retry simultaneously, creating a retry storm. How do you prevent it from overwhelming the workers and the downstream services they call?
- How do you guarantee high-priority jobs are favored without permanently starving low-priority jobs?
- How would you add support for **job dependencies / DAGs** (job B runs only after job A succeeds) on top of this design?
Quick Answer: This question evaluates understanding of distributed systems, reliable job scheduling and dispatch, fault-tolerant worker coordination (heartbeats and recovery), durable data modeling for recurring and one-time jobs, API design, prioritization and retry semantics, and scalable observability for execution logs and audit history.