Design a job scheduler with SLA and logs

Q: Design a job scheduler with SLA and logs

This is a System Design interview question from Robinhood for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

Loading...

Scenario

Design a distributed job scheduler that runs scheduled (time-based) jobs only (e.g., cron/interval). Each job execution has an SLA: if the run does not finish within the SLA window, the system must mark it as timed out and raise an error/alert. Users must be able to inspect execution logs.

Core requirements

Create/update/delete scheduled jobs.
Trigger job runs at the correct times (support cron-like schedules or fixed intervals).
Execute jobs on a fleet of workers.
Track job run state: PENDING/RUNNING/SUCCEEDED/FAILED/TIMED_OUT/CANCELED .
SLA enforcement per run: detect overruns and surface as an error/alert.
Provide log viewing per job run.

Non-functional requirements (clarify in discussion)

Scale: potentially many jobs and high execution rate.
Reliability: tolerate worker crashes, scheduler crashes, network partitions.
Correctness: avoid missed schedules; minimize duplicate runs (define acceptable semantics).
Observability: metrics, tracing, auditing.

Follow-ups to address

How to handle failures at each stage (scheduler failure, worker failure, log pipeline failure, DB outages).
Retries/backoff, dead-lettering, and idempotency.
Handling long-running jobs and cancellation.
Handling clock skew/time zones and daylight savings.

Design a job scheduler with SLA and logs

Scenario

Core requirements

Non-functional requirements (clarify in discussion)

Follow-ups to address

Solution

Comments (0)