PracHub
QuestionsPremiumLearningGuidesInterview PrepCoaches
|Home/System Design/Mithril

Design a hybrid-cloud GPU allocation service

Last updated: Mar 29, 2026

Quick Overview

This question evaluates system-design competency in distributed resource management, hybrid-cloud scheduling, GPU allocation, multi-tenant fairness, reliability, security, API design, and observability within the system design domain.

  • medium
  • Mithril
  • System Design
  • Software Engineer

Design a hybrid-cloud GPU allocation service

Company: Mithril

Role: Software Engineer

Category: System Design

Difficulty: medium

Interview Round: Onsite

## System Design: Hybrid-Cloud GPU Resource Allocation & Job Management Design a service that allocates and manages **GPU resources** for ML training jobs in a **hybrid cloud** environment (some GPUs on-prem, some in a public cloud). ### Core requirements 1. **Hybrid cloud scheduling** - Select where to run jobs (on-prem vs cloud) based on GPU availability and policy. 2. **User job submission** - Users can upload or reference: - training program (code/container) - training data - resource needs (GPU type/count, CPU/RAM), and optional constraints (region, cost, priority). 3. **Long-running job monitoring** - Track status for long-running training jobs: queued/running/completed/failed/canceled. - Provide logs/metrics and basic retry semantics. ### Non-functional requirements (discuss and make reasonable assumptions) - Multi-tenant fairness and quota. - Reliability and failure handling (node failure, cloud API failure). - Security for code/data (isolation, IAM). - Scalability: many concurrent users and jobs. ### Deliverables - High-level architecture and main components. - APIs (submit job, get status, list jobs, cancel). - Scheduling strategy and data model. - How data/code is stored and made available to compute. - Observability and operational concerns.

Quick Answer: This question evaluates system-design competency in distributed resource management, hybrid-cloud scheduling, GPU allocation, multi-tenant fairness, reliability, security, API design, and observability within the system design domain.

Mithril logo
Mithril
Feb 12, 2026, 12:00 AM
Software Engineer
Onsite
System Design
3
0

System Design: Hybrid-Cloud GPU Resource Allocation & Job Management

Design a service that allocates and manages GPU resources for ML training jobs in a hybrid cloud environment (some GPUs on-prem, some in a public cloud).

Core requirements

  1. Hybrid cloud scheduling
    • Select where to run jobs (on-prem vs cloud) based on GPU availability and policy.
  2. User job submission
    • Users can upload or reference:
      • training program (code/container)
      • training data
      • resource needs (GPU type/count, CPU/RAM), and optional constraints (region, cost, priority).
  3. Long-running job monitoring
    • Track status for long-running training jobs: queued/running/completed/failed/canceled.
    • Provide logs/metrics and basic retry semantics.

Non-functional requirements (discuss and make reasonable assumptions)

  • Multi-tenant fairness and quota.
  • Reliability and failure handling (node failure, cloud API failure).
  • Security for code/data (isolation, IAM).
  • Scalability: many concurrent users and jobs.

Deliverables

  • High-level architecture and main components.
  • APIs (submit job, get status, list jobs, cancel).
  • Scheduling strategy and data model.
  • How data/code is stored and made available to compute.
  • Observability and operational concerns.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More Mithril•More Software Engineer•Mithril Software Engineer•Mithril System Design•Software Engineer System Design
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.