Design a distributed cron job scheduler. Requirements: support cron expressions and time zones (including DST), high availability with leader election and failover, horizontal scalability to millions of scheduled jobs, exactly-once or at-least-once execution semantics, idempotency and deduplication, misfire handling and catch-up policies, retry with backoff, pause/resume, job dependencies, worker registration and heartbeats, multi-tenant isolation and quotas, authentication/authorization, auditing, and observability (metrics, logs, traces, alerts). Describe components (metadata store, scheduler, dispatcher, workers, timing wheels or hierarchical timers, message bus), data models, scheduling algorithms, and consistency choices. Discuss failure modes (clock skew, partition, worker crash), capacity planning, and testing strategies.

This question evaluates competence in distributed systems architecture, scalable scheduler design, consistency and delivery semantics, fault tolerance, multitenancy, and operational observability.

How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a hard difficulty System Design question, commonly asked during Onsite rounds at DoorDash.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at DoorDash during technical interviews.

Design a distributed cron job scheduler | DoorDash Interview Question

System Design: Distributed Cron Job Scheduler

Context

You are designing a multi-tenant, internet-scale cron job scheduler that triggers tasks according to cron expressions across different time zones. The system must support millions of scheduled jobs, be highly available, and ensure reliable dispatch and execution across a fleet of workers.

Functional Requirements

Cron expressions
- Support standard cron syntax and IANA time zones, including daylight saving time (DST) correctness.
Execution semantics
- Exactly-once or at-least-once delivery semantics (explain trade-offs).
- Idempotency and deduplication across retries and failovers.
- Misfire handling and catch-up policies (e.g., skip, fire-latest, fire-all, bounded catch-up).
- Retries with backoff and jitter.
- Pause/resume jobs; optional start/end time windows.
- Job dependencies (e.g., run B after A completes; DAG support is a plus).
Worker lifecycle
- Worker registration, capability discovery, and heartbeats.
- Safe re-assignment on worker failure.
Multi-tenancy
- Tenant isolation, quotas, priorities, and fair sharing.
Security and governance
- Authentication and authorization (RBAC or ABAC).
- Auditing of API changes and executions.
Observability
- Metrics, logs, traces, and alerting for SLOs and failure detection.

Non-Functional Requirements

High availability with leader election and failover (no single point of failure).
Horizontal scalability to millions of jobs and tens of thousands of triggers per second.
Low and predictable scheduling latency (e.g., p99 < a few seconds) and bounded scheduler lag.
Strong durability for job metadata and execution records.

What to Deliver

Describe the following:

Architecture and components (metadata store, scheduler shards, dispatcher, workers, timing wheels or hierarchical timers, message bus, coordination service).
Data models (jobs, schedules, executions, attempts, dependencies, workers, tenants, quotas, audit logs).
Scheduling and dispatch algorithms (cron parsing, time zone/DST handling, timer data structures, sharding strategy).
Consistency choices and delivery semantics (exactly-once vs at-least-once, idempotency, dedup).
Failure modes and mitigations (clock skew, network partition, leader/scheduler/worker crashes).
Capacity planning (orders of magnitude, partitioning, indexes, throughput estimates).
Testing strategies (unit, property, integration, chaos, time-travel, scale testing).

Functional Requirements

Cron expressions

Support standard cron syntax and IANA time zones, including daylight saving time (DST) correctness.

Execution semantics

Exactly-once or at-least-once delivery semantics (explain trade-offs).
Idempotency and deduplication across retries and failovers.
Misfire handling and catch-up policies (e.g., skip, fire-latest, fire-all, bounded catch-up).
Retries with backoff and jitter.
Pause/resume jobs; optional start/end time windows.
Job dependencies (e.g., run B after A completes; DAG support is a plus).

Worker lifecycle

Worker registration, capability discovery, and heartbeats.
Safe re-assignment on worker failure.

Multi-tenancy

Tenant isolation, quotas, priorities, and fair sharing.

Security and governance

Authentication and authorization (RBAC or ABAC).
Auditing of API changes and executions.

Observability

Metrics, logs, traces, and alerting for SLOs and failure detection.

Non-Functional Requirements

High availability with leader election and failover (no single point of failure).

Horizontal scalability to millions of jobs and tens of thousands of triggers per second.

Low and predictable scheduling latency (e.g., p99 < a few seconds) and bounded scheduler lag.

Strong durability for job metadata and execution records.

What to Deliver

Describe the following:

Architecture and components (metadata store, scheduler shards, dispatcher, workers, timing wheels or hierarchical timers, message bus, coordination service).

Data models (jobs, schedules, executions, attempts, dependencies, workers, tenants, quotas, audit logs).

Scheduling and dispatch algorithms (cron parsing, time zone/DST handling, timer data structures, sharding strategy).

Consistency choices and delivery semantics (exactly-once vs at-least-once, idempotency, dedup).

Failure modes and mitigations (clock skew, network partition, leader/scheduler/worker crashes).

Capacity planning (orders of magnitude, partitioning, indexes, throughput estimates).

Testing strategies (unit, property, integration, chaos, time-travel, scale testing).

Design a distributed cron job scheduler

Quick Overview

System Design: Distributed Cron Job Scheduler

Context

Functional Requirements

Non-Functional Requirements

What to Deliver

Solution

Submit Your Answer

Design a distributed cron job scheduler

Quick Overview

System Design: Distributed Cron Job Scheduler

Context

Functional Requirements

Non-Functional Requirements

What to Deliver

Solution

Submit Your Answer