Design a distributed job scheduler

Q: Design a distributed job scheduler

This question evaluates a candidate's competency in distributed systems and system design, focusing on scheduling semantics, reliability guarantees (at-least-once vs exactly-once), fault tolerance, scalability, and state management for background job orchestration.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

Design a distributed job scheduler system that can run background jobs at specific times or on recurring schedules (similar to cron but scalable and fault-tolerant).

The system should support:

One-time jobs scheduled to run at a specific timestamp.
Recurring jobs (e.g., "run every 5 minutes", "run every day at 1 AM").
Reliable execution so that each job runs at least once, and preferably exactly once where possible.
Horizontal scalability and high availability.

Assume clients (internal services or users) can:

Create, update, and delete jobs.
Query job status and execution history.

Design the system end-to-end. Cover:

High-level architecture and main components.
How you store job definitions and schedules.
How scheduling (deciding when a job should run) is done in a distributed setting.
How workers pick up and execute jobs.
How to ensure fault tolerance and avoid duplicate executions as much as possible.
How to scale the system as the number of jobs and execution frequency grows.

Design a distributed job scheduler

Solution

Comments (0)

Design a distributed job scheduler

Overview

Solution

Comments (0)