Design a distributed job scheduling system (microservice-based, running in the cloud) with the following requirements:
Functional Requirements
-
Create scheduled jobs
-
A client can create a job via an API.
-
For each job, the client specifies:
-
Job identifier (optional; otherwise generated by the system).
-
A
schedule
(e.g., cron-style, or "run at time T", or "every X minutes").
-
The
task
to run (e.g., a script name, container image, or some executable description).
-
Resource requirements (e.g., CPU/memory tier or machine type) at a high level.
-
A
timeout
value.
-
A
timeout handler
(what to do if the job exceeds its timeout: kill, mark failed, retry, etc.).
-
Run jobs reliably at the designated time
-
Each job should run as close as possible to its scheduled time.
-
The system is
distributed
: jobs can run on many worker machines / microservices.
-
The system must tolerate machine failures and restarts.
-
Handle jobs that don't finish on time
-
If a job exceeds its declared timeout, the system must:
-
Detect this condition.
-
Execute the configured timeout behavior (e.g., kill the job, mark as failed, trigger a handler job, or reschedule, depending on your design).
-
Query past job status
-
Provide an API to query:
-
The status of a specific job run (e.g., PENDING, RUNNING, SUCCESS, FAILED, TIMED_OUT).
-
The history of runs for a job (e.g., last N runs, with timestamps and outcomes).
-
Query job logs
-
Provide an API to fetch
logs
for a past job run (e.g., stdout/stderr or structured logs).
-
Distributed microservice model
-
Assume the interviewer wants a
microservice-based
, horizontally scalable architecture.
-
Components should be cloud-hosted services (e.g., on containers/VMs) and able to scale independently.
Non-Functional Requirements (assume reasonable scale)
-
Support on the order of
hundreds of thousands
of active scheduled jobs.
-
Jobs may be scheduled with
minute-level granularity
(or better, if you choose).
-
System should be
fault tolerant
: if a scheduler instance or worker crashes, jobs should still eventually run.
-
Favor
reliability over perfect exact timing
(e.g., running a job a few seconds late is acceptable, but skipping it completely is not).
Expected Discussion
Describe and justify:
-
High-level architecture
-
What microservices you would define (e.g., API service, job metadata service, scheduler, worker/runner service, log service, etc.).
-
How these services communicate (HTTP/REST, message queues, etc.).
-
Data model and storage
-
How you store job definitions and schedules.
-
How you store job run history and statuses.
-
How and where you store logs (e.g., blob/object storage vs database vs log store).
-
Choice of databases (relational vs NoSQL) and why.
-
Scheduling logic
-
How the system determines
which jobs to run at what time
.
-
How you implement a
distributed, fault-tolerant scheduler
:
-
How many scheduler instances are running.
-
How they share work or elect a leader.
-
How you avoid double-scheduling or missing jobs.
-
Job execution
-
How jobs are dispatched to worker machines.
-
How you track job start/end, collect exit codes, and enforce
timeouts
.
-
How you communicate logs back to the system.
-
Reliability and failure handling
-
What happens if a scheduler instance crashes.
-
What happens if a worker dies mid-job.
-
How you avoid running the
same job twice
vs. accepting at-least-once execution with idempotent jobs.
-
How you recover after a restart (e.g., replay job state from durable storage).
-
APIs for querying status and logs
-
Example REST endpoints for:
-
Creating/updating/deleting jobs.
-
Fetching job configuration.
-
Fetching job run history.
-
Fetching logs for a specific run.
-
Scalability considerations
-
How you scale horizontally when the number of jobs or workers increases.
-
Any sharding or partitioning of job data or schedules.
Explain your design step by step, call out trade-offs, and explicitly mention any assumptions you make (e.g., exact vs approximate scheduling accuracy, at-least-once vs exactly-once execution semantics, log retention policies, etc.).