PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/System Design/OpenAI

Design the GPU Job Scheduler for a Text-to-Video Generation Service

Last updated: Jul 1, 2026

Quick Overview

This system design question evaluates a candidate's ability to design a resource-scheduling layer for scarce, expensive compute under sustained overload. It tests skills in queueing theory, priority scheduling with fairness guarantees, and preemption mechanics, commonly asked to assess practical distributed-systems and resource-management thinking beyond textbook architecture patterns.

  • hard
  • OpenAI
  • System Design
  • Software Engineer

Design the GPU Job Scheduler for a Text-to-Video Generation Service

Company: OpenAI

Role: Software Engineer

Category: System Design

Difficulty: hard

Interview Round: Technical Screen

# Design the GPU Job Scheduler for a Text-to-Video Generation Service You are designing the backend that runs generation jobs for a large text-to-video model service. A user submits a prompt, and the system runs an expensive GPU inference job that produces a short video. Each job takes from tens of seconds to several minutes and may occupy one or more high-end GPUs for its entire duration. The fleet of GPUs is finite and expensive, and demand routinely exceeds capacity. Users belong to different **tiers** (for example free, paid, and enterprise/API), each with different latency expectations, quotas, and priority. Your job is to design the job-scheduling and GPU-resource-control layer: how requests are admitted and queued, how the scheduler allocates scarce GPUs across tiers fairly, how higher-priority work **preempts** lower-priority work, and how the system stays efficient and reliable under sustained overload. This is a scheduling and resource-management problem. The video model itself is a black box that you call; do not design the model. ### Constraints & Assumptions - Jobs are long-running (≈20 s to ≈5 min) and GPU-bound; a job may need 1-8 GPUs and runs to completion on the GPUs it is assigned. - Fleet is on the order of thousands of GPUs across regions; capacity is the binding constraint and demand is bursty. - Tiers: enterprise (tight SLA, highest priority), paid (best-effort with a target latency), free (lowest priority, may be heavily delayed or shed). - Submission is asynchronous: the client gets a `job_id` immediately and then polls or receives a push when the video is ready. - A job can be checkpointed at coarse boundaries (e.g. diffusion steps) at some cost, or restarted from scratch if preempted. - Optimize for high GPU utilization and tier-appropriate latency while never starving a tier indefinitely. ### Clarifying Questions to Ask - What are the concrete per-tier SLOs (e.g. enterprise p95 start-time, free best-effort) and per-tier quotas / rate limits? - Is preemption acceptable for paid jobs, or only for free jobs? Must a preempted job resume from a checkpoint, or is restart acceptable? - Is a single job confined to one node/region, or can it span nodes? How homogeneous is the GPU fleet (one SKU vs. mixed)? - What is the desired behavior under sustained overload — queue with a visible wait, shed free traffic, or apply backpressure to clients? - Are results cacheable/deduplicable (identical prompt + params), and is there a cost/credit budget per tier to enforce? - How important is fairness *within* a tier (one enterprise customer must not monopolize capacity)? ### Part 1: Job intake, lifecycle, and high-level architecture Design the request path and the lifecycle of a long-running generation job. Cover how a submission is accepted and acknowledged asynchronously, where jobs wait, how they are dispatched to GPU workers, and how results and status are returned. Define the job state machine and the major components. ```hint Where to start Decouple submission from execution with a durable queue: accept fast, return a `job_id`, and let a scheduler match queued jobs to free GPUs. Persist job state so a scheduler or worker crash never loses a job. ``` ```hint Long jobs Because jobs run for minutes, design for status streaming (poll or push) and for worker heartbeats/leases so a dead worker's job can be detected and rescheduled. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 2: Multi-tier scheduling and fairness Design the scheduling policy that decides which queued job runs next when a GPU frees up. The policy must respect tier priority and per-tier SLOs while preventing any tier — or any single customer within a tier — from being starved or from monopolizing the fleet. Define the queue structure and the selection algorithm. ```hint Policy choice Strict priority alone starves the free tier under load. Reach for weighted fair queuing / a deficit or virtual-time scheme, or priority with reserved capacity floors per tier, plus aging so long-waiting low-tier jobs eventually rise. ``` ```hint Admission control Enforce per-tier and per-customer quotas/rate limits at admission so the scheduler is never asked to be fair across a flood that should have been shed or throttled upstream. ``` #### Clarifying Questions for this Part - Should each tier have a guaranteed capacity floor and a burst ceiling, or pure relative priority weights? - Is the fairness unit the tier, the customer, or the individual job, and over what time window is fairness measured? #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### Part 3: Preemption and GPU resource control Design preemption and low-level GPU allocation. When a high-priority job arrives and the fleet is full, the scheduler must free GPUs by preempting lower-priority running jobs. Specify which jobs are eligible, how a job is preempted (checkpoint-and-resume vs. kill-and-requeue), how the freed GPUs are reclaimed and reassigned, and how you keep utilization high (packing multi-GPU jobs, avoiding fragmentation) while bounding wasted work. ```hint Preemption mechanics Pick victims by lowest priority and least lost work (e.g. fewest completed steps / nearest to a checkpoint). Checkpoint at coarse boundaries so a preempted job resumes instead of restarting, and cap how often a job can be preempted to avoid livelock. ``` ```hint Packing and fragmentation Treat allocation as bin-packing GPUs to jobs; co-schedule the GPUs of a multi-GPU job together (gang scheduling) so a job never holds GPUs while waiting for the rest, which would waste capacity and risk deadlock. ``` ```hint Don't thrash Add hysteresis: only preempt when the priority gap and expected wait justify the lost work, and prefer draining/queuing when a job will finish soon. ``` #### What This Part Should Cover ```premium-lock What This Part Should Cover ``` ### What a Strong Answer Covers ```premium-lock What a Strong Answer Covers ``` ### Follow-up Questions - A single enterprise customer suddenly submits 10,000 jobs. How do you protect both other enterprise customers and the lower tiers, and what does the offending customer experience? - How would you add autoscaling of the GPU fleet, and how do scale-up latency (minutes to provision a GPU node) and cost change your queuing and preemption decisions? - How do you handle a heterogeneous fleet (multiple GPU SKUs) and jobs that only fit on certain hardware? - Identical prompts and parameters recur. How would you add result caching/dedup safely, and what are the correctness and privacy considerations? - How do you keep the scheduler itself highly available and consistent — what happens when the scheduler crashes mid-decision, and how do you avoid double-dispatching a job to two workers?

Quick Answer: This system design question evaluates a candidate's ability to design a resource-scheduling layer for scarce, expensive compute under sustained overload. It tests skills in queueing theory, priority scheduling with fairness guarantees, and preemption mechanics, commonly asked to assess practical distributed-systems and resource-management thinking beyond textbook architecture patterns.

Related Interview Questions

  • Design a Payment Processing System with Exactly-Once Charging - OpenAI (hard)
  • Design a Payment Processing System - OpenAI (hard)
  • Design a Payment Processing Service (Merchant to Payment Provider) - OpenAI (medium)
  • Design Video Generation Orchestration - OpenAI (medium)
  • Design CI/CD Build Caching - OpenAI
|Home/System Design/OpenAI

Design the GPU Job Scheduler for a Text-to-Video Generation Service

OpenAI logo
OpenAI
Jun 26, 2026, 12:00 AM
hardSoftware EngineerTechnical ScreenSystem Design
0
0

Design the GPU Job Scheduler for a Text-to-Video Generation Service

You are designing the backend that runs generation jobs for a large text-to-video model service. A user submits a prompt, and the system runs an expensive GPU inference job that produces a short video. Each job takes from tens of seconds to several minutes and may occupy one or more high-end GPUs for its entire duration. The fleet of GPUs is finite and expensive, and demand routinely exceeds capacity.

Users belong to different tiers (for example free, paid, and enterprise/API), each with different latency expectations, quotas, and priority. Your job is to design the job-scheduling and GPU-resource-control layer: how requests are admitted and queued, how the scheduler allocates scarce GPUs across tiers fairly, how higher-priority work preempts lower-priority work, and how the system stays efficient and reliable under sustained overload.

This is a scheduling and resource-management problem. The video model itself is a black box that you call; do not design the model.

Constraints & Assumptions

  • Jobs are long-running (≈20 s to ≈5 min) and GPU-bound; a job may need 1-8 GPUs and runs to completion on the GPUs it is assigned.
  • Fleet is on the order of thousands of GPUs across regions; capacity is the binding constraint and demand is bursty.
  • Tiers: enterprise (tight SLA, highest priority), paid (best-effort with a target latency), free (lowest priority, may be heavily delayed or shed).
  • Submission is asynchronous: the client gets a job_id immediately and then polls or receives a push when the video is ready.
  • A job can be checkpointed at coarse boundaries (e.g. diffusion steps) at some cost, or restarted from scratch if preempted.
  • Optimize for high GPU utilization and tier-appropriate latency while never starving a tier indefinitely.

Clarifying Questions to Ask

  • What are the concrete per-tier SLOs (e.g. enterprise p95 start-time, free best-effort) and per-tier quotas / rate limits?
  • Is preemption acceptable for paid jobs, or only for free jobs? Must a preempted job resume from a checkpoint, or is restart acceptable?
  • Is a single job confined to one node/region, or can it span nodes? How homogeneous is the GPU fleet (one SKU vs. mixed)?
  • What is the desired behavior under sustained overload — queue with a visible wait, shed free traffic, or apply backpressure to clients?
  • Are results cacheable/deduplicable (identical prompt + params), and is there a cost/credit budget per tier to enforce?
  • How important is fairness within a tier (one enterprise customer must not monopolize capacity)?

Part 1: Job intake, lifecycle, and high-level architecture

Design the request path and the lifecycle of a long-running generation job. Cover how a submission is accepted and acknowledged asynchronously, where jobs wait, how they are dispatched to GPU workers, and how results and status are returned. Define the job state machine and the major components.

What This Part Should Cover Premium

Part 2: Multi-tier scheduling and fairness

Design the scheduling policy that decides which queued job runs next when a GPU frees up. The policy must respect tier priority and per-tier SLOs while preventing any tier — or any single customer within a tier — from being starved or from monopolizing the fleet. Define the queue structure and the selection algorithm.

Clarifying Questions for this Part

  • Should each tier have a guaranteed capacity floor and a burst ceiling, or pure relative priority weights?
  • Is the fairness unit the tier, the customer, or the individual job, and over what time window is fairness measured?

What This Part Should Cover Premium

Part 3: Preemption and GPU resource control

Design preemption and low-level GPU allocation. When a high-priority job arrives and the fleet is full, the scheduler must free GPUs by preempting lower-priority running jobs. Specify which jobs are eligible, how a job is preempted (checkpoint-and-resume vs. kill-and-requeue), how the freed GPUs are reclaimed and reassigned, and how you keep utilization high (packing multi-GPU jobs, avoiding fragmentation) while bounding wasted work.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

  • A single enterprise customer suddenly submits 10,000 jobs. How do you protect both other enterprise customers and the lower tiers, and what does the offending customer experience?
  • How would you add autoscaling of the GPU fleet, and how do scale-up latency (minutes to provision a GPU node) and cost change your queuing and preemption decisions?
  • How do you handle a heterogeneous fleet (multiple GPU SKUs) and jobs that only fit on certain hardware?
  • Identical prompts and parameters recur. How would you add result caching/dedup safely, and what are the correctness and privacy considerations?
  • How do you keep the scheduler itself highly available and consistent — what happens when the scheduler crashes mid-decision, and how do you avoid double-dispatching a job to two workers?

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More OpenAI•More Software Engineer•OpenAI Software Engineer•OpenAI System Design•Software Engineer System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.