Design a Cron Job Scheduler
Company: Snowflake
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Technical Screen
Design a **distributed cron job scheduler** — a service that triggers user-defined jobs on recurring, cron-style schedules. Assume **worker (execution) capacity is effectively unlimited**, so the design should focus on *correct, reliable scheduling and state management* rather than on autoscaling the compute that runs the jobs.
The system must support:
- **Recurring jobs** defined by cron-style schedules (e.g. `0 */6 * * *` = every 6 hours).
- **`pause(job_id)`** — stop scheduling *new* runs for that job, but let any already-running executions finish normally.
- **`resume(job_id)`** — allow future scheduled runs to resume.
Produce an end-to-end design covering: **(a)** the public API, **(b)** the data model, **(c)** the core scheduler loop that finds and fires due jobs, **(d)** correct `pause`/`resume` semantics including the race against in-flight scheduling, **(e)** safe operation with multiple scheduler instances, and **(f)** crash recovery and reliability (no lost or silently-dropped triggers).
```hint Where to start
With unlimited workers, the hard part isn't *running* the jobs — it's deciding when each run is due, recording that decision durably, and handing it off, all while replicas race and crash. Ask yourself which of those concerns belong together and which must be kept apart, and where a job's authoritative state should live so the firing logic is just a loop over it.
```
```hint Representing "due"
The scheduler's hot loop has to answer "which jobs should fire right now?" cheaply, even with a million job definitions. What single piece of per-job state would turn that question into one indexed query? And once a job fires, what should the *next* fire time be computed relative to — so a loop that runs a few seconds late doesn't slowly drag the whole cadence off the cron grid?
```
```hint Multiple schedulers without double-firing
When *N* replicas all see the same due job at the same instant, what stops them from firing it *N* times? Pin down the one effect that must happen exactly once per tick, then ask what would let only a single replica "win" that step — and whether recording the run and advancing the schedule can drift apart if they aren't tied together.
```
```hint Don't lose a trigger across a crash
The decision to fire lives in one system; the work to be done ends up in another. A crash can land *between* them. Trace it: if a scheduler dies after it has decided a run is due but before that run is safely handed off, is the trigger gone forever? What property would those two steps need so a crash can never leave one done without the other — and what does that force you to assume about how the receiving side handles a message it might see twice?
```
```hint The pause race
`pause` is just a metadata flip; it must never reach into a running worker. The subtle case is timing: what happens if the scheduler has *already* picked up this job for the current tick at the exact moment `pause` commits? There's more than one defensible answer here — your job is to notice the window exists, decide what you want to happen in it, and say so, rather than assume it can't occur.
```
### Constraints & Assumptions
- **Worker capacity is effectively unlimited** — never gate the design on "not enough workers." The hard problems live in scheduling, not execution.
- **Scale (assume, state your own if different):** on the order of $10^6$ job definitions; tens of thousands of due jobs in the busiest minute. Schedules are mostly minute-granularity cron expressions.
- **Correctness bar:** no *lost* triggers (a due, active job must run), and at-least-once delivery with strong effort to minimize duplicates. Jobs are not assumed idempotent by default, so call out where you rely on idempotency.
- **Availability:** the scheduler must keep firing through single-instance crashes; multiple scheduler replicas run for HA.
- Treat the actual job *body* as an opaque payload (e.g. an HTTP call or a queued task); you are designing the scheduler, not the job logic.
### Clarifying Questions to Ask
- What schedule granularity must we support — minute-level cron only, or down to seconds? Are arbitrary timezones / DST in scope, or is UTC-only acceptable?
- On `resume` after a long pause (or after downtime), should we **backfill** the missed runs, or just resume from the next future occurrence?
- Is **at-least-once** acceptable (rely on idempotent jobs / dedup) or do we need at-most-once for some job classes?
- What's the expected mix: many jobs each firing rarely, or a few jobs firing very frequently? What is the busiest-minute fan-out?
- What does a "run" hand off to — an internal task queue, an HTTP webhook, a Kubernetes Job? What retry and timeout policy is expected on failure?
- How long must run history / audit data be retained?
### What a Strong Answer Covers
- **Clean plane separation:** API → metadata store → scheduler/dispatcher → durable queue → workers, and *why* a queue still helps even with unlimited workers (decoupling, buffering, reliability).
- **Data model** with the fields that actually drive correctness: per-job `status`, `next_run_at`, `last_scheduled_at`, a concurrency token (`version` or lease), and a separate run-history table with a unique idempotency key.
- **A concrete scheduler loop:** query due jobs → claim safely → emit one durable run record → advance `next_run_at` — and the indexing that makes the due-scan cheap at $10^6$ jobs.
- **Correct multi-instance behavior:** the exact mechanism that prevents two replicas from double-firing the same tick.
- **Precise `pause`/`resume` semantics**, including an explicit treatment of the pause-vs-claim race and the backfill-on-resume decision.
- **Crash recovery & no-lost-trigger reasoning:** a durable hand-off that survives a crash between the DB write and the enqueue, stale-`RUNNING` detection via heartbeat/lease timeout, retries, and where at-least-once forces worker idempotency.
- **Scaling levers and bottlenecks** given unlimited workers (due-scan efficiency, DB write throughput, queue publish rate; partitioning / time-bucketing).
- **Observability:** scheduler lag, runs created/min, duplicate-fire count, failure rate, queue depth, and audit logs for pause/resume.
### Follow-up Questions
- Walk through exactly what happens if a scheduler instance crashes **after** inserting the run record but **before** the run reaches the queue. How does your design guarantee the trigger is neither lost nor double-delivered?
- A single job is configured `* * * * *` (every minute) but each run takes ~5 minutes. How do you handle overlap — skip, queue, or allow concurrent runs — and how does the model express that policy?
- The metadata store becomes the throughput bottleneck at $10^6$ jobs. How do you partition or shard scheduling so replicas don't contend, while preserving the no-double-fire guarantee?
- How would you support per-job timezones and survive a daylight-saving transition without dropping or doubling the 2 a.m. run?
Quick Answer: This question evaluates skills in designing reliable distributed schedulers, including durable state modeling, concurrency and race-condition reasoning, idempotency, and fault-tolerant handoff semantics, and is commonly asked to assess a candidate's ability to guarantee correctness and coordination when multiple replicas must manage recurring job triggers. Categorized under system design and distributed systems with overlap into data modeling and consistency, it emphasizes practical application of architectural and operational trade-offs rather than purely theoretical concepts.