PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/Mithril

Design Distributed Cloud Job Scheduling

Last updated: Jun 5, 2026

Quick Overview

This question evaluates proficiency in distributed systems design, resource scheduling and placement, stateful job orchestration, API and data-model design, fault-tolerance, and reconciliation for robust cloud job execution.

  • medium
  • Mithril
  • System Design
  • Software Engineer

Design Distributed Cloud Job Scheduling

Company: Mithril

Role: Software Engineer

Category: System Design

Difficulty: medium

Interview Round: Onsite

Design a distributed cloud job scheduling system. Users submit jobs that require resources such as CPU count, memory, GPU type, GPU count, region, and optional placement constraints. The system must assign each job to a suitable machine, run it, and expose job status to users. Functional requirements: - Submit a job with resource requirements and execution metadata. - Query job status. - Cancel a job. - Match jobs to machines based on available CPU, memory, GPU resources, GPU type, region, health, and current capacity. - Track job states such as submitted, queued, scheduled, running, succeeded, failed, and canceled. - Retry failed jobs when appropriate. - Detect worker or machine failures through heartbeats. - Recover from inconsistent states caused by scheduler crashes, worker crashes, queue delays, duplicate messages, or machine loss. Discuss APIs, data models, architecture, scheduling logic, failure handling, retry behavior, heartbeats, reconciliation, and how to ensure the job database is the source of truth rather than the message queue.

Quick Answer: This question evaluates proficiency in distributed systems design, resource scheduling and placement, stateful job orchestration, API and data-model design, fault-tolerance, and reconciliation for robust cloud job execution.

Related Interview Questions

  • Design a hybrid-cloud GPU allocation service - Mithril (medium)
Mithril logo
Mithril
May 18, 2026, 12:00 AM
Software Engineer
Onsite
System Design
0
0

Design a distributed cloud job scheduling system.

Users submit jobs that require resources such as CPU count, memory, GPU type, GPU count, region, and optional placement constraints. The system must assign each job to a suitable machine, run it, and expose job status to users.

Functional requirements:

  • Submit a job with resource requirements and execution metadata.
  • Query job status.
  • Cancel a job.
  • Match jobs to machines based on available CPU, memory, GPU resources, GPU type, region, health, and current capacity.
  • Track job states such as submitted, queued, scheduled, running, succeeded, failed, and canceled.
  • Retry failed jobs when appropriate.
  • Detect worker or machine failures through heartbeats.
  • Recover from inconsistent states caused by scheduler crashes, worker crashes, queue delays, duplicate messages, or machine loss.

Discuss APIs, data models, architecture, scheduling logic, failure handling, retry behavior, heartbeats, reconciliation, and how to ensure the job database is the source of truth rather than the message queue.

Solution

Show

Submit Your Answer

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Mithril•More Software Engineer•Mithril Software Engineer•Mithril System Design•Software Engineer System Design
PracHub

Master your tech interviews with 8,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.