Design scalable job scheduler and query dashboard

Q: Design scalable job scheduler and query dashboard

This question evaluates a candidate's system design skills in building a scalable, fault-tolerant job scheduling system and efficient near-real-time time-window querying, focusing on distributed scheduling, storage and indexing strategies, partitioning, and reliability guarantees, and is commonly asked to probe trade-offs among scalability, latency, availability, and consistency. Belonging to the System Design domain, it examines both high-level architectural thinking and practical application-level considerations such as data modeling, query/index design, operational concerns like hot partitions, and horizontal scaling.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

Design a scalable, fault-tolerant job scheduling system.

The system should allow clients to schedule background jobs (for example, sending emails or running batch computations) to be executed at specific future times, and possibly on a recurring basis.

Then, as a follow-up, design how to efficiently query jobs scheduled in the next N hours in order to power a near real-time dashboard that shows upcoming jobs.

Describe:

Functional requirements (e.g., create/update/cancel jobs, one-time vs recurring jobs, job execution guarantees, etc.).
Non-functional requirements (e.g., scale, latency, reliability, availability, consistency expectations).
A high-level architecture for the job scheduler:
- Main components (API layer, scheduler, workers, storage, queues, etc.).
- How jobs are stored, assigned to workers, and executed at (approximately) the right time.
- How you ensure reliability (no job lost, minimal duplicates) and fault tolerance.
A data model for jobs (what fields you store, how you index them).
An API and storage/query design to efficiently fetch all jobs scheduled between now and now + N hours for a dashboard. Assume:
- There can be a very large number of jobs.
- The dashboard needs low-latency, high-QPS reads.
- N might vary per request (e.g., 1 hour, 6 hours, 24 hours).

Explain your trade-offs, including:

How you partition or index data to support both scheduling and time-window queries.
How you would avoid full table scans when querying the next N hours.
How you would handle hot partitions (e.g., many jobs around the same time) and horizontal scaling.

Design scalable job scheduler and query dashboard

Solution

Comments (0)

Design scalable job scheduler and query dashboard

Overview

Solution

Comments (0)