Design a distributed job scheduler
Company: Meta
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Onsite
##### Question
Design a distributed job scheduler that can run background jobs at specific times or on recurring schedules (similar to cron but scalable and fault-tolerant). Design the system end-to-end.
**Functional requirements**
1. Support **one-time jobs** scheduled to run at a specific timestamp, **immediate (run-now) jobs**, and **recurring jobs** (e.g. "run every 5 minutes", "run every day at 1 AM", cron expressions).
2. Execute jobs on a horizontally scalable **worker fleet** (e.g. via HTTP callbacks, internal RPCs, or messages to another system).
3. Provide **at-least-once execution** so every job runs at least once, with **optional exactly-once** semantics for jobs that need it.
4. Support **retries with backoff** and a **dead-letter queue (DLQ)** for jobs that exhaust their retries.
5. Let clients **create, update, delete, pause/resume, and trigger-now** jobs, and **query job status, execution history, and logs**.
**Non-functional requirements**
6. Horizontal scalability and high availability.
7. Reliability and fault tolerance: avoid duplicate executions as much as possible and guarantee eventual execution even under instance failures.
**Out of scope:** a full workflow / DAG engine with task dependencies (e.g. Airflow).
**In your answer, cover:**
1. High-level architecture and the main components.
2. How you store job definitions and schedules (data model).
3. How **distributed scheduling** (deciding *when* a job should run) is coordinated across multiple scheduler instances without collisions or missed jobs.
4. How workers pick up jobs (leasing / heartbeats) and execute them.
5. How you ensure fault tolerance, retries/backoff/DLQ, and limit duplicate executions (at-least-once vs exactly-once).
6. How the system scales as the number of jobs and execution frequency grows.
7. Monitoring, observability, and operational concerns.
Quick Answer: A Meta software-engineer onsite system-design question: design a scalable, fault-tolerant distributed job scheduler for one-time, immediate, and recurring (cron) jobs. It probes control-plane vs data-plane separation, distributed scheduling without double-runs, worker leasing/heartbeats, retries with backoff and a dead-letter queue, and at-least-once vs exactly-once execution guarantees.