System Design: Hybrid-Cloud GPU Resource Allocation & Job Management
Design a service that allocates and manages GPU resources for ML training jobs in a hybrid cloud environment (some GPUs on-prem, some in a public cloud).
Core requirements
-
Hybrid cloud scheduling
-
Select where to run jobs (on-prem vs cloud) based on GPU availability and policy.
-
User job submission
-
Users can upload or reference:
-
training program (code/container)
-
training data
-
resource needs (GPU type/count, CPU/RAM), and optional constraints (region, cost, priority).
-
Long-running job monitoring
-
Track status for long-running training jobs: queued/running/completed/failed/canceled.
-
Provide logs/metrics and basic retry semantics.
Non-functional requirements (discuss and make reasonable assumptions)
-
Multi-tenant fairness and quota.
-
Reliability and failure handling (node failure, cloud API failure).
-
Security for code/data (isolation, IAM).
-
Scalability: many concurrent users and jobs.
Deliverables
-
High-level architecture and main components.
-
APIs (submit job, get status, list jobs, cancel).
-
Scheduling strategy and data model.
-
How data/code is stored and made available to compute.
-
Observability and operational concerns.