Design a job scheduler with SLA and logs
Company: Robinhood
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Technical Screen
## Scenario
Design a **distributed job scheduler** that runs **scheduled (time-based) jobs only** (e.g., cron/interval). Each job execution has an **SLA**: if the run does not finish within the SLA window, the system must **mark it as timed out and raise an error/alert**. Users must be able to **inspect execution logs**.
## Core requirements
- Create/update/delete scheduled jobs.
- Trigger job runs at the correct times (support cron-like schedules or fixed intervals).
- Execute jobs on a fleet of workers.
- Track job run state: `PENDING/RUNNING/SUCCEEDED/FAILED/TIMED_OUT/CANCELED`.
- **SLA enforcement** per run: detect overruns and surface as an error/alert.
- Provide **log viewing** per job run.
## Non-functional requirements (clarify in discussion)
- Scale: potentially many jobs and high execution rate.
- Reliability: tolerate worker crashes, scheduler crashes, network partitions.
- Correctness: avoid missed schedules; minimize duplicate runs (define acceptable semantics).
- Observability: metrics, tracing, auditing.
## Follow-ups to address
- How to handle failures at each stage (scheduler failure, worker failure, log pipeline failure, DB outages).
- Retries/backoff, dead-lettering, and idempotency.
- Handling long-running jobs and cancellation.
- Handling clock skew/time zones and daylight savings.
Quick Answer: This question evaluates understanding of distributed job scheduling, SLA enforcement, observability, and fault-tolerant execution across worker fleets, assessing competencies in scalability, reliability, state tracking, and operational logging within the System Design domain.