Design a GPU-aware pod scheduler
Company: Together AI
Role: Software Engineer
Category: System Design
Difficulty: hard
Interview Round: Technical Screen
Design an object-oriented, GPU-aware pod scheduler and cluster manager. Each Node has the shape {name: string, total_gpu: int, running_pods: Pod[]}. Each Pod has the shape {name: string, gpu_required: int}. Implement APIs: add_node(name, total_gpu), remove_node(name), add_pod(name, gpu_required), schedule_pod(pod_name) that assigns the pod to a node with enough free GPUs, remove_pod(pod_name), get_node_utilization(name), and list_nodes()/list_pods(). Specify data structures to support efficient lookups of nodes by available GPUs and pods by name. Describe and justify a placement strategy (e.g., best-fit or first-fit) and how you'd update indexes on every add/remove/schedule/evict operation. Discuss concurrency control (simultaneous adds/schedules), idempotency, and failure handling (e.g., removing a node that still has running pods, pod rescheduling on node removal). Provide time and space complexity for each API and write pseudocode for schedule_pod using your chosen strategy. Include edge cases like gpu_required > total_gpu on any node and fragmentation when multiple small pods occupy a large node.
Quick Answer: This question evaluates object-oriented system design, resource-aware scheduling algorithms, data structures and indexes for efficient lookups, concurrency control and failure handling, plus time/space complexity reasoning for placing GPU-requesting pods on nodes.