How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a hard difficulty System Design question, commonly asked during Technical Screen rounds at OpenAI.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at OpenAI during technical interviews.

Design the GPU Job Scheduler for a Text-to-Video Generation Service

Q: Design the GPU Job Scheduler for a Text-to-Video Generation Service

This system design question evaluates a candidate's ability to design a resource-scheduling layer for scarce, expensive compute under sustained overload. It tests skills in queueing theory, priority scheduling with fairness guarantees, and preemption mechanics, commonly asked to assess practical distributed-systems and resource-management thinking beyond textbook architecture patterns.

Design the GPU Job Scheduler for a Text-to-Video Generation Service

You are designing the backend that runs generation jobs for a large text-to-video model service. A user submits a prompt, and the system runs an expensive GPU inference job that produces a short video. Each job takes from tens of seconds to several minutes and may occupy one or more high-end GPUs for its entire duration. The fleet of GPUs is finite and expensive, and demand routinely exceeds capacity.

Users belong to different tiers (for example free, paid, and enterprise/API), each with different latency expectations, quotas, and priority. Your job is to design the job-scheduling and GPU-resource-control layer: how requests are admitted and queued, how the scheduler allocates scarce GPUs across tiers fairly, how higher-priority work preempts lower-priority work, and how the system stays efficient and reliable under sustained overload.

This is a scheduling and resource-management problem. The video model itself is a black box that you call; do not design the model.

Constraints & Assumptions

Jobs are long-running (≈20 s to ≈5 min) and GPU-bound; a job may need 1-8 GPUs and runs to completion on the GPUs it is assigned.
Fleet is on the order of thousands of GPUs across regions; capacity is the binding constraint and demand is bursty.
Tiers: enterprise (tight SLA, highest priority), paid (best-effort with a target latency), free (lowest priority, may be heavily delayed or shed).
Submission is asynchronous: the client gets a job_id immediately and then polls or receives a push when the video is ready.
A job can be checkpointed at coarse boundaries (e.g. diffusion steps) at some cost, or restarted from scratch if preempted.
Optimize for high GPU utilization and tier-appropriate latency while never starving a tier indefinitely.

Clarifying Questions to Ask

What are the concrete per-tier SLOs (e.g. enterprise p95 start-time, free best-effort) and per-tier quotas / rate limits?
Is preemption acceptable for paid jobs, or only for free jobs? Must a preempted job resume from a checkpoint, or is restart acceptable?
Is a single job confined to one node/region, or can it span nodes? How homogeneous is the GPU fleet (one SKU vs. mixed)?
What is the desired behavior under sustained overload — queue with a visible wait, shed free traffic, or apply backpressure to clients?
Are results cacheable/deduplicable (identical prompt + params), and is there a cost/credit budget per tier to enforce?
How important is fairness within a tier (one enterprise customer must not monopolize capacity)?

Part 1: Job intake, lifecycle, and high-level architecture

Design the request path and the lifecycle of a long-running generation job. Cover how a submission is accepted and acknowledged asynchronously, where jobs wait, how they are dispatched to GPU workers, and how results and status are returned. Define the job state machine and the major components.

What This Part Should Cover Premium

Part 2: Multi-tier scheduling and fairness

Design the scheduling policy that decides which queued job runs next when a GPU frees up. The policy must respect tier priority and per-tier SLOs while preventing any tier — or any single customer within a tier — from being starved or from monopolizing the fleet. Define the queue structure and the selection algorithm.

Clarifying Questions for this Part

Should each tier have a guaranteed capacity floor and a burst ceiling, or pure relative priority weights?
Is the fairness unit the tier, the customer, or the individual job, and over what time window is fairness measured?

What This Part Should Cover Premium

Part 3: Preemption and GPU resource control

Design preemption and low-level GPU allocation. When a high-priority job arrives and the fleet is full, the scheduler must free GPUs by preempting lower-priority running jobs. Specify which jobs are eligible, how a job is preempted (checkpoint-and-resume vs. kill-and-requeue), how the freed GPUs are reclaimed and reassigned, and how you keep utilization high (packing multi-GPU jobs, avoiding fragmentation) while bounding wasted work.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

A single enterprise customer suddenly submits 10,000 jobs. How do you protect both other enterprise customers and the lower tiers, and what does the offending customer experience?
How would you add autoscaling of the GPU fleet, and how do scale-up latency (minutes to provision a GPU node) and cost change your queuing and preemption decisions?
How do you handle a heterogeneous fleet (multiple GPU SKUs) and jobs that only fit on certain hardware?
Identical prompts and parameters recur. How would you add result caching/dedup safely, and what are the correctness and privacy considerations?
How do you keep the scheduler itself highly available and consistent — what happens when the scheduler crashes mid-decision, and how do you avoid double-dispatching a job to two workers?

Design the GPU Job Scheduler for a Text-to-Video Generation Service

This is a scheduling and resource-management problem. The video model itself is a black box that you call; do not design the model.

Constraints & Assumptions

Jobs are long-running (≈20 s to ≈5 min) and GPU-bound; a job may need 1-8 GPUs and runs to completion on the GPUs it is assigned.
Fleet is on the order of thousands of GPUs across regions; capacity is the binding constraint and demand is bursty.
Tiers: enterprise (tight SLA, highest priority), paid (best-effort with a target latency), free (lowest priority, may be heavily delayed or shed).
Submission is asynchronous: the client gets a job_id immediately and then polls or receives a push when the video is ready.
A job can be checkpointed at coarse boundaries (e.g. diffusion steps) at some cost, or restarted from scratch if preempted.
Optimize for high GPU utilization and tier-appropriate latency while never starving a tier indefinitely.

Clarifying Questions to Ask

What are the concrete per-tier SLOs (e.g. enterprise p95 start-time, free best-effort) and per-tier quotas / rate limits?
Is preemption acceptable for paid jobs, or only for free jobs? Must a preempted job resume from a checkpoint, or is restart acceptable?
Is a single job confined to one node/region, or can it span nodes? How homogeneous is the GPU fleet (one SKU vs. mixed)?
What is the desired behavior under sustained overload — queue with a visible wait, shed free traffic, or apply backpressure to clients?
Are results cacheable/deduplicable (identical prompt + params), and is there a cost/credit budget per tier to enforce?
How important is fairness within a tier (one enterprise customer must not monopolize capacity)?

Part 1: Job intake, lifecycle, and high-level architecture

What This Part Should Cover Premium

Part 2: Multi-tier scheduling and fairness

Clarifying Questions for this Part

Should each tier have a guaranteed capacity floor and a burst ceiling, or pure relative priority weights?
Is the fairness unit the tier, the customer, or the individual job, and over what time window is fairness measured?

What This Part Should Cover Premium

Part 3: Preemption and GPU resource control

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

A single enterprise customer suddenly submits 10,000 jobs. How do you protect both other enterprise customers and the lower tiers, and what does the offending customer experience?
How would you add autoscaling of the GPU fleet, and how do scale-up latency (minutes to provision a GPU node) and cost change your queuing and preemption decisions?
How do you handle a heterogeneous fleet (multiple GPU SKUs) and jobs that only fit on certain hardware?
Identical prompts and parameters recur. How would you add result caching/dedup safely, and what are the correctness and privacy considerations?
How do you keep the scheduler itself highly available and consistent — what happens when the scheduler crashes mid-decision, and how do you avoid double-dispatching a job to two workers?

Design the GPU Job Scheduler for a Text-to-Video Generation Service

Quick Overview

Design the GPU Job Scheduler for a Text-to-Video Generation Service

Design the GPU Job Scheduler for a Text-to-Video Generation Service

Constraints & Assumptions

Clarifying Questions to Ask

Part 1: Job intake, lifecycle, and high-level architecture

What This Part Should Cover Premium

Part 2: Multi-tier scheduling and fairness

Clarifying Questions for this Part

What This Part Should Cover Premium

Part 3: Preemption and GPU resource control

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP

Design the GPU Job Scheduler for a Text-to-Video Generation Service

Quick Overview

Design the GPU Job Scheduler for a Text-to-Video Generation Service

Design the GPU Job Scheduler for a Text-to-Video Generation Service

Constraints & Assumptions

Clarifying Questions to Ask

Part 1: Job intake, lifecycle, and high-level architecture

What This Part Should Cover Premium

Part 2: Multi-tier scheduling and fairness

Clarifying Questions for this Part

What This Part Should Cover Premium

Part 3: Preemption and GPU resource control

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP