PracHub
QuestionsPremiumLearningGuidesInterview PrepNEWCoaches
|Home/System Design/Robinhood

Design a distributed job scheduler service

Last updated: May 3, 2026

Quick Overview

This question evaluates understanding of distributed systems and microservice architecture, scheduling and timeout semantics, fault tolerance, scalability, and data modeling for job metadata, run history, and logs.

  • easy
  • Robinhood
  • System Design
  • Software Engineer

Design a distributed job scheduler service

Company: Robinhood

Role: Software Engineer

Category: System Design

Difficulty: easy

Interview Round: Technical Screen

Design a **distributed job scheduling system** (microservice-based, running in the cloud) with the following requirements: ### Functional Requirements 1. **Create scheduled jobs** - A client can create a job via an API. - For each job, the client specifies: - Job identifier (optional; otherwise generated by the system). - A **schedule** (e.g., cron-style, or "run at time T", or "every X minutes"). - The **task** to run (e.g., a script name, container image, or some executable description). - Resource requirements (e.g., CPU/memory tier or machine type) at a high level. - A **timeout** value. - A **timeout handler** (what to do if the job exceeds its timeout: kill, mark failed, retry, etc.). 2. **Run jobs reliably at the designated time** - Each job should run as close as possible to its scheduled time. - The system is **distributed**: jobs can run on many worker machines / microservices. - The system must tolerate machine failures and restarts. 3. **Handle jobs that don't finish on time** - If a job exceeds its declared timeout, the system must: - Detect this condition. - Execute the configured timeout behavior (e.g., kill the job, mark as failed, trigger a handler job, or reschedule, depending on your design). 4. **Query past job status** - Provide an API to query: - The status of a specific job run (e.g., PENDING, RUNNING, SUCCESS, FAILED, TIMED_OUT). - The history of runs for a job (e.g., last N runs, with timestamps and outcomes). 5. **Query job logs** - Provide an API to fetch **logs** for a past job run (e.g., stdout/stderr or structured logs). 6. **Distributed microservice model** - Assume the interviewer wants a **microservice-based**, horizontally scalable architecture. - Components should be cloud-hosted services (e.g., on containers/VMs) and able to scale independently. ### Non-Functional Requirements (assume reasonable scale) - Support on the order of **hundreds of thousands** of active scheduled jobs. - Jobs may be scheduled with **minute-level granularity** (or better, if you choose). - System should be **fault tolerant**: if a scheduler instance or worker crashes, jobs should still eventually run. - Favor **reliability over perfect exact timing** (e.g., running a job a few seconds late is acceptable, but skipping it completely is not). ### Expected Discussion Describe and justify: 1. **High-level architecture** - What microservices you would define (e.g., API service, job metadata service, scheduler, worker/runner service, log service, etc.). - How these services communicate (HTTP/REST, message queues, etc.). 2. **Data model and storage** - How you store job definitions and schedules. - How you store job run history and statuses. - How and where you store logs (e.g., blob/object storage vs database vs log store). - Choice of databases (relational vs NoSQL) and why. 3. **Scheduling logic** - How the system determines **which jobs to run at what time**. - How you implement a **distributed, fault-tolerant scheduler**: - How many scheduler instances are running. - How they share work or elect a leader. - How you avoid double-scheduling or missing jobs. 4. **Job execution** - How jobs are dispatched to worker machines. - How you track job start/end, collect exit codes, and enforce **timeouts**. - How you communicate logs back to the system. 5. **Reliability and failure handling** - What happens if a scheduler instance crashes. - What happens if a worker dies mid-job. - How you avoid running the **same job twice** vs. accepting at-least-once execution with idempotent jobs. - How you recover after a restart (e.g., replay job state from durable storage). 6. **APIs for querying status and logs** - Example REST endpoints for: - Creating/updating/deleting jobs. - Fetching job configuration. - Fetching job run history. - Fetching logs for a specific run. 7. **Scalability considerations** - How you scale horizontally when the number of jobs or workers increases. - Any sharding or partitioning of job data or schedules. Explain your design step by step, call out trade-offs, and explicitly mention any assumptions you make (e.g., exact vs approximate scheduling accuracy, at-least-once vs exactly-once execution semantics, log retention policies, etc.).

Quick Answer: This question evaluates understanding of distributed systems and microservice architecture, scheduling and timeout semantics, fault tolerance, scalability, and data modeling for job metadata, run history, and logs.

Related Interview Questions

  • Design a Photo Album App - Robinhood (medium)
  • Design a distributed job scheduler - Robinhood
  • Design a Photo Management Service - Robinhood (medium)
  • Design a job scheduler with SLA and logs - Robinhood (medium)
  • Design authorization and audit logging systems - Robinhood (medium)
Robinhood logo
Robinhood
Dec 8, 2025, 7:45 PM
Software Engineer
Technical Screen
System Design
21
0

Design a distributed job scheduling system (microservice-based, running in the cloud) with the following requirements:

Functional Requirements

  1. Create scheduled jobs
    • A client can create a job via an API.
    • For each job, the client specifies:
      • Job identifier (optional; otherwise generated by the system).
      • A schedule (e.g., cron-style, or "run at time T", or "every X minutes").
      • The task to run (e.g., a script name, container image, or some executable description).
      • Resource requirements (e.g., CPU/memory tier or machine type) at a high level.
      • A timeout value.
      • A timeout handler (what to do if the job exceeds its timeout: kill, mark failed, retry, etc.).
  2. Run jobs reliably at the designated time
    • Each job should run as close as possible to its scheduled time.
    • The system is distributed : jobs can run on many worker machines / microservices.
    • The system must tolerate machine failures and restarts.
  3. Handle jobs that don't finish on time
    • If a job exceeds its declared timeout, the system must:
      • Detect this condition.
      • Execute the configured timeout behavior (e.g., kill the job, mark as failed, trigger a handler job, or reschedule, depending on your design).
  4. Query past job status
    • Provide an API to query:
      • The status of a specific job run (e.g., PENDING, RUNNING, SUCCESS, FAILED, TIMED_OUT).
      • The history of runs for a job (e.g., last N runs, with timestamps and outcomes).
  5. Query job logs
    • Provide an API to fetch logs for a past job run (e.g., stdout/stderr or structured logs).
  6. Distributed microservice model
    • Assume the interviewer wants a microservice-based , horizontally scalable architecture.
    • Components should be cloud-hosted services (e.g., on containers/VMs) and able to scale independently.

Non-Functional Requirements (assume reasonable scale)

  • Support on the order of hundreds of thousands of active scheduled jobs.
  • Jobs may be scheduled with minute-level granularity (or better, if you choose).
  • System should be fault tolerant : if a scheduler instance or worker crashes, jobs should still eventually run.
  • Favor reliability over perfect exact timing (e.g., running a job a few seconds late is acceptable, but skipping it completely is not).

Expected Discussion

Describe and justify:

  1. High-level architecture
    • What microservices you would define (e.g., API service, job metadata service, scheduler, worker/runner service, log service, etc.).
    • How these services communicate (HTTP/REST, message queues, etc.).
  2. Data model and storage
    • How you store job definitions and schedules.
    • How you store job run history and statuses.
    • How and where you store logs (e.g., blob/object storage vs database vs log store).
    • Choice of databases (relational vs NoSQL) and why.
  3. Scheduling logic
    • How the system determines which jobs to run at what time .
    • How you implement a distributed, fault-tolerant scheduler :
      • How many scheduler instances are running.
      • How they share work or elect a leader.
      • How you avoid double-scheduling or missing jobs.
  4. Job execution
    • How jobs are dispatched to worker machines.
    • How you track job start/end, collect exit codes, and enforce timeouts .
    • How you communicate logs back to the system.
  5. Reliability and failure handling
    • What happens if a scheduler instance crashes.
    • What happens if a worker dies mid-job.
    • How you avoid running the same job twice vs. accepting at-least-once execution with idempotent jobs.
    • How you recover after a restart (e.g., replay job state from durable storage).
  6. APIs for querying status and logs
    • Example REST endpoints for:
      • Creating/updating/deleting jobs.
      • Fetching job configuration.
      • Fetching job run history.
      • Fetching logs for a specific run.
  7. Scalability considerations
    • How you scale horizontally when the number of jobs or workers increases.
    • Any sharding or partitioning of job data or schedules.

Explain your design step by step, call out trade-offs, and explicitly mention any assumptions you make (e.g., exact vs approximate scheduling accuracy, at-least-once vs exactly-once execution semantics, log retention policies, etc.).

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Robinhood•More Software Engineer•Robinhood Software Engineer•Robinhood System Design•Software Engineer System Design
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.