Design a distributed job scheduler service

Q: Design a distributed job scheduler service

This is a System Design interview question from Robinhood for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

Design a distributed job scheduling system (microservice-based, running in the cloud) with the following requirements:

Functional Requirements

Create scheduled jobs
- A client can create a job via an API.
- For each job, the client specifies:
  - Job identifier (optional; otherwise generated by the system).
  - A schedule (e.g., cron-style, or "run at time T", or "every X minutes").
  - The task to run (e.g., a script name, container image, or some executable description).
  - Resource requirements (e.g., CPU/memory tier or machine type) at a high level.
  - A timeout value.
  - A timeout handler (what to do if the job exceeds its timeout: kill, mark failed, retry, etc.).
Run jobs reliably at the designated time
- Each job should run as close as possible to its scheduled time.
- The system is distributed : jobs can run on many worker machines / microservices.
- The system must tolerate machine failures and restarts.
Handle jobs that don't finish on time
- If a job exceeds its declared timeout, the system must:
  - Detect this condition.
  - Execute the configured timeout behavior (e.g., kill the job, mark as failed, trigger a handler job, or reschedule, depending on your design).
Query past job status
- Provide an API to query:
  - The status of a specific job run (e.g., PENDING, RUNNING, SUCCESS, FAILED, TIMED_OUT).
  - The history of runs for a job (e.g., last N runs, with timestamps and outcomes).
Query job logs
- Provide an API to fetch logs for a past job run (e.g., stdout/stderr or structured logs).
Distributed microservice model
- Assume the interviewer wants a microservice-based , horizontally scalable architecture.
- Components should be cloud-hosted services (e.g., on containers/VMs) and able to scale independently.

Non-Functional Requirements (assume reasonable scale)

Support on the order of hundreds of thousands of active scheduled jobs.
Jobs may be scheduled with minute-level granularity (or better, if you choose).
System should be fault tolerant : if a scheduler instance or worker crashes, jobs should still eventually run.
Favor reliability over perfect exact timing (e.g., running a job a few seconds late is acceptable, but skipping it completely is not).

Expected Discussion

Describe and justify:

High-level architecture
- What microservices you would define (e.g., API service, job metadata service, scheduler, worker/runner service, log service, etc.).
- How these services communicate (HTTP/REST, message queues, etc.).
Data model and storage
- How you store job definitions and schedules.
- How you store job run history and statuses.
- How and where you store logs (e.g., blob/object storage vs database vs log store).
- Choice of databases (relational vs NoSQL) and why.
Scheduling logic
- How the system determines which jobs to run at what time .
- How you implement a distributed, fault-tolerant scheduler :
  - How many scheduler instances are running.
  - How they share work or elect a leader.
  - How you avoid double-scheduling or missing jobs.
Job execution
- How jobs are dispatched to worker machines.
- How you track job start/end, collect exit codes, and enforce timeouts .
- How you communicate logs back to the system.
Reliability and failure handling
- What happens if a scheduler instance crashes.
- What happens if a worker dies mid-job.
- How you avoid running the same job twice vs. accepting at-least-once execution with idempotent jobs.
- How you recover after a restart (e.g., replay job state from durable storage).
APIs for querying status and logs
- Example REST endpoints for:
  - Creating/updating/deleting jobs.
  - Fetching job configuration.
  - Fetching job run history.
  - Fetching logs for a specific run.
Scalability considerations
- How you scale horizontally when the number of jobs or workers increases.
- Any sharding or partitioning of job data or schedules.

Explain your design step by step, call out trade-offs, and explicitly mention any assumptions you make (e.g., exact vs approximate scheduling accuracy, at-least-once vs exactly-once execution semantics, log retention policies, etc.).

Design a distributed job scheduler service

Functional Requirements

Non-Functional Requirements (assume reasonable scale)

Expected Discussion

Solution

Comments (0)