How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a easy difficulty System Design question, commonly asked during Technical Screen rounds at Robinhood.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Robinhood during technical interviews.

Design a distributed job scheduler service | Robinhood Interview Question

Design a distributed job scheduler service

Company: Robinhood

Role: Software Engineer

Category: System Design

Difficulty: easy

Interview Round: Technical Screen

Design a **distributed job scheduling system** — a cloud-hosted, microservice-based service that lets clients register tasks to run on a schedule and runs them reliably across a fleet of worker machines. A client creates a **job** through an API. Each job specifies: - A **job identifier** (optional; the system generates one if omitted). - A **schedule** — cron-style, "run once at time $T$", or "every $X$ minutes". - The **task** to run — a script name, a container image, or some executable description. - **Resource requirements** at a high level (e.g. a CPU/memory tier or machine class). - A **timeout** value. - A **timeout handler** — what to do if the run exceeds its timeout (kill, mark failed, retry, trigger another job, etc.). The system must: 1. **Run each job reliably at (or near) its scheduled time** across many worker machines, tolerating machine failures and restarts. 2. **Handle jobs that overrun their timeout** — detect the overrun and apply the configured timeout behavior. 3. **Let clients query past run status** — the status of a specific run (`PENDING`, `RUNNING`, `SUCCESS`, `FAILED`, `TIMED_OUT`, …) and the run history for a job (e.g. the last $N$ runs with timestamps and outcomes). 4. **Let clients query logs** (stdout/stderr or structured logs) for a past run. Design the architecture, data model, scheduling logic, execution path, failure handling, and query APIs. Explain your design step by step, justify each major choice, and call out trade-offs. ```hint Where to start Split a latency-sensitive **control plane** (job CRUD, status/log queries) from a high-volume, asynchronous **data plane** (dispatching and running jobs). Sketch the services first — an API/metadata service, a scheduler, a durable queue, a worker/executor pool, a log path — then trace the data flow of a single job from "client creates it" to "logs are queryable." ``` ```hint The scheduling invariant The scheduler's hard problem is "never miss a run, never double-fire" without distributed consensus per job. Think about denormalizing a `next_run_at` onto each job row so the hot path is one indexed range scan, and making run-creation **idempotent**: insert the run row and advance `next_run_at` in the *same* transaction, with a natural key + `ON CONFLICT DO NOTHING` to absorb retries. ``` ```hint Distributing the scheduler safely To run more than one scheduler instance, consider **partitioning** the job space (hash into $P$ partitions, assign partitions to instances via a coordinator) instead of a single leader — and use `SELECT ... FOR UPDATE SKIP LOCKED` as a belt-and-suspenders guard so a brief ownership overlap can't process the same row twice. ``` ```hint Execution semantics — be honest True exactly-once across enqueue → dispatch → a side-effecting container is impossible without the task's cooperation. Aim for **at-least-once + idempotent tasks**: a durable queue (redelivery), a conditional `PENDING → RUNNING` claim to drop duplicate deliveries, a lease/heartbeat so a watchdog can reap dead workers, and a stable per-run idempotency key handed to the task. ``` ```hint Logs and run history at scale Estimate the bytes. Logs dominate (tens of KB per run × runs/sec) — they belong in **object storage**, never the metadata DB, with a lifecycle/retention rule. Run history grows unbounded, so think about **time-partitioning** the runs table and dropping old partitions. ``` ### Constraints & Assumptions - **Scale:** on the order of **hundreds of thousands** of active scheduled jobs. - **Granularity:** at least **minute-level** scheduling (finer is acceptable if your design supports it). - **Reliability over exact timing:** running a job a few seconds late is acceptable; **skipping a run entirely is a bug**. This priority should drive your delivery-semantics choice. - **Fault tolerance:** if a scheduler instance or worker crashes or restarts, every due job must still eventually run. - **Architecture:** microservice-based, cloud-hosted (containers/VMs), with components that scale independently. - Treat anything not specified above (timezone/DST handling, log retention period, exactly-once vs at-least-once, missed-window catch-up policy, security/multi-tenancy depth) as an assumption you state explicitly. ### Clarifying Questions to Ask - What execution semantics are required — **exactly-once**, or is **at-least-once with idempotent tasks** acceptable? (This is the single biggest fork in the design.) - On a **missed window** after an outage (job's `next_run_at` is now in the past), should we **fire once and skip ahead**, or **backfill every missed slot**? Is this per-job configurable? - How accurate must timing be — is "within a few seconds" of the scheduled time good enough, or are there sub-second SLAs? - What is the expected **task duration distribution** and **peak concurrency**? (This sizes the worker fleet independently of job count.) - What are the **log retention** and **run-history retention** requirements? - How rich are the **resource requirements** — a few coarse tiers, or fine-grained CPU/memory/GPU bin-packing? ### What a Strong Answer Covers - **Requirements & scope discipline:** restates functional/non-functional requirements, states the execution-semantics decision up front, and explicitly sets aside non-goals (DAG orchestration, secrets pipeline, etc.). - **Sizing that drives decisions:** back-of-envelope for runs/sec (including the **bursty minute-boundary** cluster), metadata write volume, run-history growth, and log throughput — used to justify batching, partitioning, and object-store-for-logs. - **Clean architecture:** named services with clear responsibilities, a **control-plane vs data-plane** split, and synchronous (REST/gRPC) vs asynchronous (durable queue) communication choices justified. - **Data model & storage choices:** relational (OLTP) for job + run state with the correctness primitives that motivate it (transactions, unique-key dedup, `SKIP LOCKED`); object storage for logs; a clearly defined **run identity** and the indexes that make the hot scheduling query fast. - **Scheduling correctness:** the tick loop, the "insert run + advance `next_run_at` in one transaction" invariant, distributing the scheduler (partitioning vs single leader), and an explicit argument for **why neither double-scheduling nor missed runs occur**. - **Execution & timeouts:** atomic claim of a run, lease/heartbeat, local timeout enforcement plus the four configurable timeout actions, and log capture. - **Failure handling:** scheduler crash, worker death mid-run, lost queue message, duplicate delivery, DB failover, full restart — each with its recovery path; a **watchdog/reaper** as the self-healing backstop. - **Query APIs:** concrete REST endpoints for job CRUD, run history (with pagination), single-run status, and log fetch (ideally via pre-signed URL). - **Scalability & trade-offs:** how each tier scales, sharding/partitioning of job data, and an honest discussion of the central bottleneck and the at-least-once-vs-exactly-once trade-off. ### Follow-up Questions - How exactly do you guarantee a job is **not double-run**, and where in your design can a duplicate still slip through? Walk through the specific crash/partition windows and what absorbs each. - A **network-partitioned but still-alive** worker keeps running its container while its lease lapses. The watchdog re-enqueues a retry. How do you keep this concurrent double-run from corrupting external state? - The minute-boundary burst (e.g. 10% of jobs fire at the top of the hour) creates a **thundering herd**. How do you keep that from overwhelming the metadata DB and the worker fleet? - How would you support **per-job resource requirements** without building a full bin-packing scheduler? When would coarse tiers stop being enough? - How would you extend this to **job dependencies / DAGs** (run B only after A succeeds), and why is that a meaningfully different system?

Quick Answer: This question evaluates understanding of distributed systems and microservice architecture, scheduling and timeout semantics, fault tolerance, scalability, and data modeling for job metadata, run history, and logs.

Design a distributed job scheduling system — a cloud-hosted, microservice-based service that lets clients register tasks to run on a schedule and runs them reliably across a fleet of worker machines.

A client creates a job through an API. Each job specifies:

A job identifier (optional; the system generates one if omitted).
A schedule — cron-style, "run once at time $T$ ", or "every $X$ minutes".
The task to run — a script name, a container image, or some executable description.
Resource requirements at a high level (e.g. a CPU/memory tier or machine class).
A timeout value.
A timeout handler — what to do if the run exceeds its timeout (kill, mark failed, retry, trigger another job, etc.).

The system must:

Run each job reliably at (or near) its scheduled time across many worker machines, tolerating machine failures and restarts.
Handle jobs that overrun their timeout — detect the overrun and apply the configured timeout behavior.
Let clients query past run status — the status of a specific run ( PENDING , RUNNING , SUCCESS , FAILED , TIMED_OUT , …) and the run history for a job (e.g. the last $N$ runs with timestamps and outcomes).
Let clients query logs (stdout/stderr or structured logs) for a past run.

Design the architecture, data model, scheduling logic, execution path, failure handling, and query APIs. Explain your design step by step, justify each major choice, and call out trade-offs.

Constraints & Assumptions

Scale: on the order of hundreds of thousands of active scheduled jobs.
Granularity: at least minute-level scheduling (finer is acceptable if your design supports it).
Reliability over exact timing: running a job a few seconds late is acceptable; skipping a run entirely is a bug . This priority should drive your delivery-semantics choice.
Fault tolerance: if a scheduler instance or worker crashes or restarts, every due job must still eventually run.
Architecture: microservice-based, cloud-hosted (containers/VMs), with components that scale independently.
Treat anything not specified above (timezone/DST handling, log retention period, exactly-once vs at-least-once, missed-window catch-up policy, security/multi-tenancy depth) as an assumption you state explicitly.

Clarifying Questions to Ask

What execution semantics are required — exactly-once , or is at-least-once with idempotent tasks acceptable? (This is the single biggest fork in the design.)
On a missed window after an outage (job's next_run_at is now in the past), should we fire once and skip ahead , or backfill every missed slot ? Is this per-job configurable?
How accurate must timing be — is "within a few seconds" of the scheduled time good enough, or are there sub-second SLAs?
What is the expected task duration distribution and peak concurrency ? (This sizes the worker fleet independently of job count.)
What are the log retention and run-history retention requirements?
How rich are the resource requirements — a few coarse tiers, or fine-grained CPU/memory/GPU bin-packing?

What a Strong Answer Covers

Requirements & scope discipline: restates functional/non-functional requirements, states the execution-semantics decision up front, and explicitly sets aside non-goals (DAG orchestration, secrets pipeline, etc.).
Sizing that drives decisions: back-of-envelope for runs/sec (including the bursty minute-boundary cluster), metadata write volume, run-history growth, and log throughput — used to justify batching, partitioning, and object-store-for-logs.
Clean architecture: named services with clear responsibilities, a control-plane vs data-plane split, and synchronous (REST/gRPC) vs asynchronous (durable queue) communication choices justified.
Data model & storage choices: relational (OLTP) for job + run state with the correctness primitives that motivate it (transactions, unique-key dedup, SKIP LOCKED ); object storage for logs; a clearly defined run identity and the indexes that make the hot scheduling query fast.
Scheduling correctness: the tick loop, the "insert run + advance next_run_at in one transaction" invariant, distributing the scheduler (partitioning vs single leader), and an explicit argument for why neither double-scheduling nor missed runs occur .
Execution & timeouts: atomic claim of a run, lease/heartbeat, local timeout enforcement plus the four configurable timeout actions, and log capture.
Failure handling: scheduler crash, worker death mid-run, lost queue message, duplicate delivery, DB failover, full restart — each with its recovery path; a watchdog/reaper as the self-healing backstop.
Query APIs: concrete REST endpoints for job CRUD, run history (with pagination), single-run status, and log fetch (ideally via pre-signed URL).
Scalability & trade-offs: how each tier scales, sharding/partitioning of job data, and an honest discussion of the central bottleneck and the at-least-once-vs-exactly-once trade-off.

Follow-up Questions

How exactly do you guarantee a job is not double-run , and where in your design can a duplicate still slip through? Walk through the specific crash/partition windows and what absorbs each.
A network-partitioned but still-alive worker keeps running its container while its lease lapses. The watchdog re-enqueues a retry. How do you keep this concurrent double-run from corrupting external state?
The minute-boundary burst (e.g. 10% of jobs fire at the top of the hour) creates a thundering herd . How do you keep that from overwhelming the metadata DB and the worker fleet?
How would you support per-job resource requirements without building a full bin-packing scheduler? When would coarse tiers stop being enough?
How would you extend this to job dependencies / DAGs (run B only after A succeeds), and why is that a meaningfully different system?

Design a distributed job scheduler service

Company: Robinhood

Role: Software Engineer

Category: System Design

Difficulty: easy

Interview Round: Technical Screen

Design a distributed job scheduler service

Quick Overview

Design a distributed job scheduler service

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Submit Your Answer to Earn 20XP

Design a distributed job scheduler service

Quick Overview

Design a distributed job scheduler service

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Submit Your Answer to Earn 20XP