PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/Meta

Design a distributed job scheduler

Last updated: Jun 15, 2026

Quick Overview

A Meta software-engineer onsite system-design question: design a scalable, fault-tolerant distributed job scheduler for one-time, immediate, and recurring (cron) jobs. It probes control-plane vs data-plane separation, distributed scheduling without double-runs, worker leasing/heartbeats, retries with backoff and a dead-letter queue, and at-least-once vs exactly-once execution guarantees.

  • medium
  • Meta
  • System Design
  • Software Engineer

Design a distributed job scheduler

Company: Meta

Role: Software Engineer

Category: System Design

Difficulty: medium

Interview Round: Onsite

##### Question Design a distributed job scheduler that can run background jobs at specific times or on recurring schedules (similar to cron but scalable and fault-tolerant). Design the system end-to-end. **Functional requirements** 1. Support **one-time jobs** scheduled to run at a specific timestamp, **immediate (run-now) jobs**, and **recurring jobs** (e.g. "run every 5 minutes", "run every day at 1 AM", cron expressions). 2. Execute jobs on a horizontally scalable **worker fleet** (e.g. via HTTP callbacks, internal RPCs, or messages to another system). 3. Provide **at-least-once execution** so every job runs at least once, with **optional exactly-once** semantics for jobs that need it. 4. Support **retries with backoff** and a **dead-letter queue (DLQ)** for jobs that exhaust their retries. 5. Let clients **create, update, delete, pause/resume, and trigger-now** jobs, and **query job status, execution history, and logs**. **Non-functional requirements** 6. Horizontal scalability and high availability. 7. Reliability and fault tolerance: avoid duplicate executions as much as possible and guarantee eventual execution even under instance failures. **Out of scope:** a full workflow / DAG engine with task dependencies (e.g. Airflow). **In your answer, cover:** 1. High-level architecture and the main components. 2. How you store job definitions and schedules (data model). 3. How **distributed scheduling** (deciding *when* a job should run) is coordinated across multiple scheduler instances without collisions or missed jobs. 4. How workers pick up jobs (leasing / heartbeats) and execute them. 5. How you ensure fault tolerance, retries/backoff/DLQ, and limit duplicate executions (at-least-once vs exactly-once). 6. How the system scales as the number of jobs and execution frequency grows. 7. Monitoring, observability, and operational concerns.

Quick Answer: A Meta software-engineer onsite system-design question: design a scalable, fault-tolerant distributed job scheduler for one-time, immediate, and recurring (cron) jobs. It probes control-plane vs data-plane separation, distributed scheduling without double-runs, worker leasing/heartbeats, retries with backoff and a dead-letter queue, and at-least-once vs exactly-once execution guarantees.

Related Interview Questions

  • Design Top-K, Crawler, and Chess Systems - Meta (hard)
  • Design Search And Web Crawling Systems - Meta (medium)
  • Design an Instagram-Style Social Feed - Meta (medium)
  • Design an Online Game Leaderboard - Meta (hard)
  • Design an On-Demand Delivery Platform - Meta (medium)
Meta logo
Meta
Dec 8, 2025, 6:32 PM
Software Engineer
Onsite
System Design
4
0
Question

Design a distributed job scheduler that can run background jobs at specific times or on recurring schedules (similar to cron but scalable and fault-tolerant). Design the system end-to-end.

Functional requirements

  1. Support one-time jobs scheduled to run at a specific timestamp, immediate (run-now) jobs , and recurring jobs (e.g. "run every 5 minutes", "run every day at 1 AM", cron expressions).
  2. Execute jobs on a horizontally scalable worker fleet (e.g. via HTTP callbacks, internal RPCs, or messages to another system).
  3. Provide at-least-once execution so every job runs at least once, with optional exactly-once semantics for jobs that need it.
  4. Support retries with backoff and a dead-letter queue (DLQ) for jobs that exhaust their retries.
  5. Let clients create, update, delete, pause/resume, and trigger-now jobs, and query job status, execution history, and logs .

Non-functional requirements 6. Horizontal scalability and high availability. 7. Reliability and fault tolerance: avoid duplicate executions as much as possible and guarantee eventual execution even under instance failures.

Out of scope: a full workflow / DAG engine with task dependencies (e.g. Airflow).

In your answer, cover:

  1. High-level architecture and the main components.
  2. How you store job definitions and schedules (data model).
  3. How distributed scheduling (deciding when a job should run) is coordinated across multiple scheduler instances without collisions or missed jobs.
  4. How workers pick up jobs (leasing / heartbeats) and execute them.
  5. How you ensure fault tolerance, retries/backoff/DLQ, and limit duplicate executions (at-least-once vs exactly-once).
  6. How the system scales as the number of jobs and execution frequency grows.
  7. Monitoring, observability, and operational concerns.

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Meta•More Software Engineer•Meta Software Engineer•Meta System Design•Software Engineer System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.