PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/Robinhood

Design a distributed job scheduler

Last updated: Jun 17, 2026

Quick Overview

This question evaluates understanding of distributed systems, reliable job scheduling and dispatch, fault-tolerant worker coordination (heartbeats and recovery), durable data modeling for recurring and one-time jobs, API design, prioritization and retry semantics, and scalable observability for execution logs and audit history.

  • Robinhood
  • System Design
  • Software Engineer

Design a distributed job scheduler

Company: Robinhood

Role: Software Engineer

Category: System Design

Interview Round: Onsite

Design a **distributed job scheduling system** for internal engineering teams (think "cron as a service"). Teams register jobs through an API; the platform reliably triggers them at the right time, runs them on a pool of workers, and lets owners observe, retry, and manage them. The system must support: - **One-time jobs** scheduled for a specific future timestamp. - **Recurring jobs** with cron-like schedules. - **Per-job priority**. - **Retries with exponential backoff** and a dead-letter path after max attempts. - **Pause, resume, and cancel** operations on a job. - **Job status and run-history queries**. - **Worker heartbeats and failure recovery** (a crashed worker's job must not be lost). - **Execution logs and audit history** that owners can search. Walk through the **APIs**, the **data model**, how **due jobs are selected and dispatched**, how **workers safely claim jobs**, how the system **recovers from scheduler and worker crashes without losing jobs**, and how to **store and search execution logs** without overloading the primary transactional database. ```hint Where to start — what do you persist? A recurring "every minute" job runs forever, so you clearly can't store a row per future occurrence. What is the *minimal* durable record that still guarantees no run is ever lost? And as a run actually executes, what transient state shows up — which worker has it, which attempt this is, did it succeed — that doesn't belong on the long-lived record? Drawing this line cleanly is what makes recurring jobs cheap and crash recovery tractable; getting it wrong forces you to either duplicate definitions or lose per-run history. ``` ```hint Finding what's due — at a few hundred/sec Suppose you scan the whole job table every second asking "what's due now?" As the table grows, what fraction of every scan is wasted re-reading jobs scheduled for next week or next year, and what breaks first? Push on three pressures: how to stop re-reading cold far-future rows, how to spread the scanning across many scheduler instances without two of them firing the same job, and how to get sub-second trigger precision once you've narrowed down to the handful of jobs due in the next few seconds. ``` ```hint Handing a job to a worker — and getting it back if the worker dies A worker grabs a job and then its process vanishes mid-run. A permanently-held lock would strand that job forever, but blindly handing it to a second worker risks running it twice at once. What mechanism lets ownership *expire* and be re-claimed by someone else only after the original owner has clearly gone silent — and how does a still-healthy long-running worker signal "I'm alive, don't take this from me"? Once you admit that a job can be re-delivered after a crash, what does that force you to promise about delivery, and what must the job's handler then tolerate? ``` ```hint Two failure modes that bite later (1) Creating a run involves two stores: you record it durably, then you push it onto a queue for a worker. What happens if the process dies *between* those two steps? Reason about how to make "record it" and "enqueue it" recoverable so you never end up with a lost trigger or a phantom one. (2) Execution logs are orders of magnitude larger than the job metadata. If you put them in the transactional database, it bloats fast — so where do the bulk logs live instead, what tiny reference stays behind so owners can still find them, and how do you keep them *searchable* by job, tenant, and error? ``` ### Constraints & Assumptions - Scale: **millions of jobs triggered per day** (assume a few hundred to low thousands of triggers/second at peak), tens of thousands of distinct job definitions, and a worker fleet in the hundreds. - **Low trigger latency** for near-term jobs: a job due at time $T$ should start within a small bounded delay (e.g. a couple of seconds), not minutes. - **Durability is non-negotiable**: once the API acknowledges a job, it must never be silently dropped, even across scheduler/worker/DB restarts. - The platform is **horizontally scalable** and multi-tenant (jobs belong to teams/`tenant_id`); one noisy team must not starve others. - Job payloads are small (a command, container ref, or HTTP target + args); the work itself runs inside the worker, not the scheduler. - Execution logs (stdout/stderr/structured events) can be **orders of magnitude larger** than the job metadata. ### Clarifying Questions to Ask - What delivery semantics are acceptable — at-least-once with idempotent handlers, or is exactly-once a hard requirement? - How precise must trigger timing be (sub-second, single-digit seconds, best-effort minutes)? - Are payloads bounded in size, and does the scheduler execute the work or only dispatch it to workers? - What is the retention requirement for execution logs and audit history? - Is the system multi-tenant, and do we need fairness/isolation guarantees between teams? - How long can the longest-running job run, and do we need to support job timeouts and cancellation of in-flight runs? ### What a Strong Answer Covers - A principled split between the **durable job record** and the **per-run record**, and why that split keeps recurring jobs cheap. - A **scheduling strategy** that avoids full-table scans and explains the latency/cost trade-off versus naive polling. - A concrete **data model** with the fields that drive scheduling, claiming, retries, and status. - The **read/write path**: durable write before acknowledging creation; how due jobs reach workers; how recurring jobs compute the next occurrence. - **Worker safety**: how a job is claimed, how a crashed worker's job is recovered, and an explicit, defended **delivery-guarantee** argument. - **Failure recovery** for scheduler, worker, queue, and DB — including how scheduler work is divided and the write-to-two-stores consistency problem. - **Scaling levers**: partitioned work, priority without starvation, cold/hot job separation, per-tenant fairness, worker autoscaling. - **Log/observability design** that keeps bulk logs out of the OLTP database while remaining searchable, plus the key metrics to alert on. ### Follow-up Questions - A team's recurring job runs every minute but each run takes 90 seconds. How do you handle overlapping runs, and what policy do you expose (skip, queue, allow-concurrent)? - A bug causes thousands of jobs to fail and retry simultaneously, creating a retry storm. How do you prevent it from overwhelming the workers and the downstream services they call? - How do you guarantee high-priority jobs are favored without permanently starving low-priority jobs? - How would you add support for **job dependencies / DAGs** (job B runs only after job A succeeds) on top of this design?

Quick Answer: This question evaluates understanding of distributed systems, reliable job scheduling and dispatch, fault-tolerant worker coordination (heartbeats and recovery), durable data modeling for recurring and one-time jobs, API design, prioritization and retry semantics, and scalable observability for execution logs and audit history.

Related Interview Questions

  • Design a Photo Album App - Robinhood (medium)
  • Design a Photo Management Service - Robinhood (medium)
  • Design a job scheduler with SLA and logs - Robinhood (medium)
  • Design authorization and audit logging systems - Robinhood (medium)
  • Design an At-Most-Once Job Scheduler - Robinhood (medium)
Robinhood logo
Robinhood
Apr 3, 2026, 12:00 AM
Software Engineer
Onsite
System Design
11
0
Loading...

Design a distributed job scheduling system for internal engineering teams (think "cron as a service"). Teams register jobs through an API; the platform reliably triggers them at the right time, runs them on a pool of workers, and lets owners observe, retry, and manage them.

The system must support:

  • One-time jobs scheduled for a specific future timestamp.
  • Recurring jobs with cron-like schedules.
  • Per-job priority .
  • Retries with exponential backoff and a dead-letter path after max attempts.
  • Pause, resume, and cancel operations on a job.
  • Job status and run-history queries .
  • Worker heartbeats and failure recovery (a crashed worker's job must not be lost).
  • Execution logs and audit history that owners can search.

Walk through the APIs, the data model, how due jobs are selected and dispatched, how workers safely claim jobs, how the system recovers from scheduler and worker crashes without losing jobs, and how to store and search execution logs without overloading the primary transactional database.

Constraints & Assumptions

  • Scale: millions of jobs triggered per day (assume a few hundred to low thousands of triggers/second at peak), tens of thousands of distinct job definitions, and a worker fleet in the hundreds.
  • Low trigger latency for near-term jobs: a job due at time TTT should start within a small bounded delay (e.g. a couple of seconds), not minutes.
  • Durability is non-negotiable : once the API acknowledges a job, it must never be silently dropped, even across scheduler/worker/DB restarts.
  • The platform is horizontally scalable and multi-tenant (jobs belong to teams/ tenant_id ); one noisy team must not starve others.
  • Job payloads are small (a command, container ref, or HTTP target + args); the work itself runs inside the worker, not the scheduler.
  • Execution logs (stdout/stderr/structured events) can be orders of magnitude larger than the job metadata.

Clarifying Questions to Ask

  • What delivery semantics are acceptable — at-least-once with idempotent handlers, or is exactly-once a hard requirement?
  • How precise must trigger timing be (sub-second, single-digit seconds, best-effort minutes)?
  • Are payloads bounded in size, and does the scheduler execute the work or only dispatch it to workers?
  • What is the retention requirement for execution logs and audit history?
  • Is the system multi-tenant, and do we need fairness/isolation guarantees between teams?
  • How long can the longest-running job run, and do we need to support job timeouts and cancellation of in-flight runs?

What a Strong Answer Covers

  • A principled split between the durable job record and the per-run record , and why that split keeps recurring jobs cheap.
  • A scheduling strategy that avoids full-table scans and explains the latency/cost trade-off versus naive polling.
  • A concrete data model with the fields that drive scheduling, claiming, retries, and status.
  • The read/write path : durable write before acknowledging creation; how due jobs reach workers; how recurring jobs compute the next occurrence.
  • Worker safety : how a job is claimed, how a crashed worker's job is recovered, and an explicit, defended delivery-guarantee argument.
  • Failure recovery for scheduler, worker, queue, and DB — including how scheduler work is divided and the write-to-two-stores consistency problem.
  • Scaling levers : partitioned work, priority without starvation, cold/hot job separation, per-tenant fairness, worker autoscaling.
  • Log/observability design that keeps bulk logs out of the OLTP database while remaining searchable, plus the key metrics to alert on.

Follow-up Questions

  • A team's recurring job runs every minute but each run takes 90 seconds. How do you handle overlapping runs, and what policy do you expose (skip, queue, allow-concurrent)?
  • A bug causes thousands of jobs to fail and retry simultaneously, creating a retry storm. How do you prevent it from overwhelming the workers and the downstream services they call?
  • How do you guarantee high-priority jobs are favored without permanently starving low-priority jobs?
  • How would you add support for job dependencies / DAGs (job B runs only after job A succeeds) on top of this design?

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Robinhood•More Software Engineer•Robinhood Software Engineer•Robinhood System Design•Software Engineer System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.