Design a GPU-aware pod scheduler

Q: Design a GPU-aware pod scheduler

This is a System Design interview question from Together AI for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

GPU-Aware Pod Scheduler and Cluster Manager (OO Design)

Context

You are designing a simplified, object-oriented cluster manager with a GPU-aware pod scheduler. Nodes provide a fixed number of GPUs. Pods request a fixed number of GPUs and must be placed on a single node with enough free GPUs.

Each Node has the shape: { name: string, total_gpu: int, running_pods: Pod[] } Each Pod has the shape: { name: string, gpu_required: int }

Assume pods cannot be split across nodes and GPUs are fungible (no topology/NUMA awareness).

Requirements

Implement APIs:

add_node(name, total_gpu)
remove_node(name)
add_pod(name, gpu_required)
schedule_pod(pod_name) — assigns the pod to a node with enough free GPUs
remove_pod(pod_name)
get_node_utilization(name)
list_nodes() / list_pods()

Also provide:

Data structures to support efficient lookups of nodes by available GPUs and pods by name.
A placement strategy (e.g., best-fit or first-fit) and justification.
How to update indexes on every add/remove/schedule/evict operation.
Concurrency control for simultaneous adds/schedules, idempotency, and failure handling (e.g., removing a node that still has running pods; pod rescheduling on node removal).
Time and space complexity for each API.
Pseudocode for schedule_pod using your chosen strategy.
Edge cases, including gpu_required > total_gpu on any node and fragmentation when multiple small pods occupy a large node.

Design a GPU-aware pod scheduler

GPU-Aware Pod Scheduler and Cluster Manager (OO Design)

Context

Requirements

Solution

Comments (0)