Design the GPU Job Scheduler for a Text-to-Video Generation Service
Company: OpenAI
Role: Software Engineer
Category: System Design
Difficulty: hard
Interview Round: Technical Screen
# Design the GPU Job Scheduler for a Text-to-Video Generation Service
You are designing the backend that runs generation jobs for a large text-to-video model service. A user submits a prompt, and the system runs an expensive GPU inference job that produces a short video. Each job takes from tens of seconds to several minutes and may occupy one or more high-end GPUs for its entire duration. The fleet of GPUs is finite and expensive, and demand routinely exceeds capacity.
Users belong to different **tiers** (for example free, paid, and enterprise/API), each with different latency expectations, quotas, and priority. Your job is to design the job-scheduling and GPU-resource-control layer: how requests are admitted and queued, how the scheduler allocates scarce GPUs across tiers fairly, how higher-priority work **preempts** lower-priority work, and how the system stays efficient and reliable under sustained overload.
This is a scheduling and resource-management problem. The video model itself is a black box that you call; do not design the model.
### Constraints & Assumptions
- Jobs are long-running (≈20 s to ≈5 min) and GPU-bound; a job may need 1-8 GPUs and runs to completion on the GPUs it is assigned.
- Fleet is on the order of thousands of GPUs across regions; capacity is the binding constraint and demand is bursty.
- Tiers: enterprise (tight SLA, highest priority), paid (best-effort with a target latency), free (lowest priority, may be heavily delayed or shed).
- Submission is asynchronous: the client gets a `job_id` immediately and then polls or receives a push when the video is ready.
- A job can be checkpointed at coarse boundaries (e.g. diffusion steps) at some cost, or restarted from scratch if preempted.
- Optimize for high GPU utilization and tier-appropriate latency while never starving a tier indefinitely.
### Clarifying Questions to Ask
- What are the concrete per-tier SLOs (e.g. enterprise p95 start-time, free best-effort) and per-tier quotas / rate limits?
- Is preemption acceptable for paid jobs, or only for free jobs? Must a preempted job resume from a checkpoint, or is restart acceptable?
- Is a single job confined to one node/region, or can it span nodes? How homogeneous is the GPU fleet (one SKU vs. mixed)?
- What is the desired behavior under sustained overload — queue with a visible wait, shed free traffic, or apply backpressure to clients?
- Are results cacheable/deduplicable (identical prompt + params), and is there a cost/credit budget per tier to enforce?
- How important is fairness *within* a tier (one enterprise customer must not monopolize capacity)?
### Part 1: Job intake, lifecycle, and high-level architecture
Design the request path and the lifecycle of a long-running generation job. Cover how a submission is accepted and acknowledged asynchronously, where jobs wait, how they are dispatched to GPU workers, and how results and status are returned. Define the job state machine and the major components.
```hint Where to start
Decouple submission from execution with a durable queue: accept fast, return a `job_id`, and let a scheduler match queued jobs to free GPUs. Persist job state so a scheduler or worker crash never loses a job.
```
```hint Long jobs
Because jobs run for minutes, design for status streaming (poll or push) and for worker heartbeats/leases so a dead worker's job can be detected and rescheduled.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 2: Multi-tier scheduling and fairness
Design the scheduling policy that decides which queued job runs next when a GPU frees up. The policy must respect tier priority and per-tier SLOs while preventing any tier — or any single customer within a tier — from being starved or from monopolizing the fleet. Define the queue structure and the selection algorithm.
```hint Policy choice
Strict priority alone starves the free tier under load. Reach for weighted fair queuing / a deficit or virtual-time scheme, or priority with reserved capacity floors per tier, plus aging so long-waiting low-tier jobs eventually rise.
```
```hint Admission control
Enforce per-tier and per-customer quotas/rate limits at admission so the scheduler is never asked to be fair across a flood that should have been shed or throttled upstream.
```
#### Clarifying Questions for this Part
- Should each tier have a guaranteed capacity floor and a burst ceiling, or pure relative priority weights?
- Is the fairness unit the tier, the customer, or the individual job, and over what time window is fairness measured?
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 3: Preemption and GPU resource control
Design preemption and low-level GPU allocation. When a high-priority job arrives and the fleet is full, the scheduler must free GPUs by preempting lower-priority running jobs. Specify which jobs are eligible, how a job is preempted (checkpoint-and-resume vs. kill-and-requeue), how the freed GPUs are reclaimed and reassigned, and how you keep utilization high (packing multi-GPU jobs, avoiding fragmentation) while bounding wasted work.
```hint Preemption mechanics
Pick victims by lowest priority and least lost work (e.g. fewest completed steps / nearest to a checkpoint). Checkpoint at coarse boundaries so a preempted job resumes instead of restarting, and cap how often a job can be preempted to avoid livelock.
```
```hint Packing and fragmentation
Treat allocation as bin-packing GPUs to jobs; co-schedule the GPUs of a multi-GPU job together (gang scheduling) so a job never holds GPUs while waiting for the rest, which would waste capacity and risk deadlock.
```
```hint Don't thrash
Add hysteresis: only preempt when the priority gap and expected wait justify the lost work, and prefer draining/queuing when a job will finish soon.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- A single enterprise customer suddenly submits 10,000 jobs. How do you protect both other enterprise customers and the lower tiers, and what does the offending customer experience?
- How would you add autoscaling of the GPU fleet, and how do scale-up latency (minutes to provision a GPU node) and cost change your queuing and preemption decisions?
- How do you handle a heterogeneous fleet (multiple GPU SKUs) and jobs that only fit on certain hardware?
- Identical prompts and parameters recur. How would you add result caching/dedup safely, and what are the correctness and privacy considerations?
- How do you keep the scheduler itself highly available and consistent — what happens when the scheduler crashes mid-decision, and how do you avoid double-dispatching a job to two workers?
Quick Answer: This system design question evaluates a candidate's ability to design a resource-scheduling layer for scarce, expensive compute under sustained overload. It tests skills in queueing theory, priority scheduling with fairness guarantees, and preemption mechanics, commonly asked to assess practical distributed-systems and resource-management thinking beyond textbook architecture patterns.