Design a distributed job scheduler service
Company: Robinhood
Role: Software Engineer
Category: System Design
Difficulty: easy
Interview Round: Technical Screen
Design a **distributed job scheduling system** (microservice-based, running in the cloud) with the following requirements:
### Functional Requirements
1. **Create scheduled jobs**
- A client can create a job via an API.
- For each job, the client specifies:
- Job identifier (optional; otherwise generated by the system).
- A **schedule** (e.g., cron-style, or "run at time T", or "every X minutes").
- The **task** to run (e.g., a script name, container image, or some executable description).
- Resource requirements (e.g., CPU/memory tier or machine type) at a high level.
- A **timeout** value.
- A **timeout handler** (what to do if the job exceeds its timeout: kill, mark failed, retry, etc.).
2. **Run jobs reliably at the designated time**
- Each job should run as close as possible to its scheduled time.
- The system is **distributed**: jobs can run on many worker machines / microservices.
- The system must tolerate machine failures and restarts.
3. **Handle jobs that don't finish on time**
- If a job exceeds its declared timeout, the system must:
- Detect this condition.
- Execute the configured timeout behavior (e.g., kill the job, mark as failed, trigger a handler job, or reschedule, depending on your design).
4. **Query past job status**
- Provide an API to query:
- The status of a specific job run (e.g., PENDING, RUNNING, SUCCESS, FAILED, TIMED_OUT).
- The history of runs for a job (e.g., last N runs, with timestamps and outcomes).
5. **Query job logs**
- Provide an API to fetch **logs** for a past job run (e.g., stdout/stderr or structured logs).
6. **Distributed microservice model**
- Assume the interviewer wants a **microservice-based**, horizontally scalable architecture.
- Components should be cloud-hosted services (e.g., on containers/VMs) and able to scale independently.
### Non-Functional Requirements (assume reasonable scale)
- Support on the order of **hundreds of thousands** of active scheduled jobs.
- Jobs may be scheduled with **minute-level granularity** (or better, if you choose).
- System should be **fault tolerant**: if a scheduler instance or worker crashes, jobs should still eventually run.
- Favor **reliability over perfect exact timing** (e.g., running a job a few seconds late is acceptable, but skipping it completely is not).
### Expected Discussion
Describe and justify:
1. **High-level architecture**
- What microservices you would define (e.g., API service, job metadata service, scheduler, worker/runner service, log service, etc.).
- How these services communicate (HTTP/REST, message queues, etc.).
2. **Data model and storage**
- How you store job definitions and schedules.
- How you store job run history and statuses.
- How and where you store logs (e.g., blob/object storage vs database vs log store).
- Choice of databases (relational vs NoSQL) and why.
3. **Scheduling logic**
- How the system determines **which jobs to run at what time**.
- How you implement a **distributed, fault-tolerant scheduler**:
- How many scheduler instances are running.
- How they share work or elect a leader.
- How you avoid double-scheduling or missing jobs.
4. **Job execution**
- How jobs are dispatched to worker machines.
- How you track job start/end, collect exit codes, and enforce **timeouts**.
- How you communicate logs back to the system.
5. **Reliability and failure handling**
- What happens if a scheduler instance crashes.
- What happens if a worker dies mid-job.
- How you avoid running the **same job twice** vs. accepting at-least-once execution with idempotent jobs.
- How you recover after a restart (e.g., replay job state from durable storage).
6. **APIs for querying status and logs**
- Example REST endpoints for:
- Creating/updating/deleting jobs.
- Fetching job configuration.
- Fetching job run history.
- Fetching logs for a specific run.
7. **Scalability considerations**
- How you scale horizontally when the number of jobs or workers increases.
- Any sharding or partitioning of job data or schedules.
Explain your design step by step, call out trade-offs, and explicitly mention any assumptions you make (e.g., exact vs approximate scheduling accuracy, at-least-once vs exactly-once execution semantics, log retention policies, etc.).
Quick Answer: This question evaluates understanding of distributed systems and microservice architecture, scheduling and timeout semantics, fault tolerance, scalability, and data modeling for job metadata, run history, and logs.