How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a medium difficulty System Design question, commonly asked during Onsite rounds at Mithril.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Mithril during technical interviews.

Design Distributed Cloud Job Scheduling | Mithril Interview Question

Q: Design Distributed Cloud Job Scheduling

This question evaluates proficiency in distributed systems design, resource scheduling and placement, stateful job orchestration, API and data-model design, fault-tolerance, and reconciliation for robust cloud job execution.

Design a distributed cloud job scheduling system.

Users submit jobs that require resources such as CPU count, memory, GPU type, GPU count, region, and optional placement constraints. The system must assign each job to a suitable machine, run it, and expose job status to users.

Functional requirements:

Submit a job with resource requirements and execution metadata.
Query job status.
Cancel a job.
Match jobs to machines based on available CPU, memory, GPU resources, GPU type, region, health, and current capacity.
Track job states such as submitted, queued, scheduled, running, succeeded, failed, and canceled.
Retry failed jobs when appropriate.
Detect worker or machine failures through heartbeats.
Recover from inconsistent states caused by scheduler crashes, worker crashes, queue delays, duplicate messages, or machine loss.

Discuss APIs, data models, architecture, scheduling logic, failure handling, retry behavior, heartbeats, reconciliation, and how to ensure the job database is the source of truth rather than the message queue.

Design a distributed cloud job scheduling system.

Functional requirements:

Submit a job with resource requirements and execution metadata.
Query job status.
Cancel a job.
Match jobs to machines based on available CPU, memory, GPU resources, GPU type, region, health, and current capacity.
Track job states such as submitted, queued, scheduled, running, succeeded, failed, and canceled.
Retry failed jobs when appropriate.
Detect worker or machine failures through heartbeats.
Recover from inconsistent states caused by scheduler crashes, worker crashes, queue delays, duplicate messages, or machine loss.

Design Distributed Cloud Job Scheduling

Quick Overview

Solution

Submit Your Answer

Design Distributed Cloud Job Scheduling

Quick Overview

Solution

Submit Your Answer