GPU-Aware Pod Scheduler and Cluster Manager (OO Design)
Context
You are designing a simplified, object-oriented cluster manager with a GPU-aware pod scheduler. Nodes provide a fixed number of GPUs. Pods request a fixed number of GPUs and must be placed on a single node with enough free GPUs.
Each Node has the shape: { name: string, total_gpu: int, running_pods: Pod[] }
Each Pod has the shape: { name: string, gpu_required: int }
Assume pods cannot be split across nodes and GPUs are fungible (no topology/NUMA awareness).
Requirements
Implement APIs:
-
add_node(name, total_gpu)
-
remove_node(name)
-
add_pod(name, gpu_required)
-
schedule_pod(pod_name) — assigns the pod to a node with enough free GPUs
-
remove_pod(pod_name)
-
get_node_utilization(name)
-
list_nodes() / list_pods()
Also provide:
-
Data structures to support efficient lookups of nodes by available GPUs and pods by name.
-
A placement strategy (e.g., best-fit or first-fit) and justification.
-
How to update indexes on every add/remove/schedule/evict operation.
-
Concurrency control for simultaneous adds/schedules, idempotency, and failure handling (e.g., removing a node that still has running pods; pod rescheduling on node removal).
-
Time and space complexity for each API.
-
Pseudocode for schedule_pod using your chosen strategy.
-
Edge cases, including gpu_required > total_gpu on any node and fragmentation when multiple small pods occupy a large node.