Design Distributed Cloud Job Scheduling
Company: Mithril
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Onsite
Design a distributed cloud job scheduling system.
Users submit jobs that require resources such as CPU count, memory, GPU type, GPU count, region, and optional placement constraints. The system must assign each job to a suitable machine, run it, and expose job status to users.
Functional requirements:
- Submit a job with resource requirements and execution metadata.
- Query job status.
- Cancel a job.
- Match jobs to machines based on available CPU, memory, GPU resources, GPU type, region, health, and current capacity.
- Track job states such as submitted, queued, scheduled, running, succeeded, failed, and canceled.
- Retry failed jobs when appropriate.
- Detect worker or machine failures through heartbeats.
- Recover from inconsistent states caused by scheduler crashes, worker crashes, queue delays, duplicate messages, or machine loss.
Discuss APIs, data models, architecture, scheduling logic, failure handling, retry behavior, heartbeats, reconciliation, and how to ensure the job database is the source of truth rather than the message queue.
Quick Answer: This question evaluates proficiency in distributed systems design, resource scheduling and placement, stateful job orchestration, API and data-model design, fault-tolerance, and reconciliation for robust cloud job execution.