How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

What difficulty level is this interview question?

This is a hard difficulty ML System Design question, commonly asked during Onsite rounds at OpenAI.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at OpenAI during technical interviews.

Design a Text-to-Video Generation System | OpenAI Interview Question

Q: Design a Text-to-Video Generation System

This question evaluates a candidate's competency in designing scalable ML inference platforms, including GPU resource management and scheduling, asynchronous job orchestration and state-machine-driven job lifecycles, failure classification and recovery, artifact storage, rate limiting, and monitoring for quality and latency.

Design a Sora-like text-to-video generation platform.

Users submit a text prompt, optional generation settings (duration, resolution, fps, seed, model variant), and possibly optional conditioning media such as an init image or reference clip. The system generates a short video and returns a downloadable result once the job is complete. Because a single clip takes seconds-to-minutes of GPU time, the system is inherently asynchronous: a submit returns a job handle immediately, and the user polls or receives a webhook when the result is ready.

Your design should cover the user-facing API and job lifecycle, the high-level service architecture, how GPU inference workers are scheduled, how the system handles unstable workers / crashes / retries / partial failures, how intermediate and final artifacts are stored, how safety / rate limits / quotas are enforced, and how quality / latency / reliability are monitored. The hardest, most-probed areas are (a) the job lifecycle / state machine and (b) failure handling when workers are unstable — go deep on both.

Constraints & Assumptions

State your own numbers in the interview — the figures below are illustrative, not benchmarks.

A single job is a short clip (e.g. 5–10 s at up to 720p, 24 fps), taking on the order of tens of seconds to a few minutes of GPU wall-time on one accelerator — 3–6 orders of magnitude slower than a typical web request.
The scarce, expensive resource is the GPU accelerator . API, queue, and database compute are cheap by comparison; the design optimizes for GPU utilization and fairness , not request throughput.
Jobs are long-running and stateful mid-flight . Worker crashes, OOMs, and spot-instance preemptions are routine , not exceptional — fault tolerance is a first-class requirement.
Output is regulated content : both the prompt and the generated frames must pass safety/moderation, and outputs typically require watermarking / provenance.
Assume a target sustained throughput (e.g. ~1 job/s) and multiple user tiers (free / paid / interactive preview) with different priority.
The system is fully async : every job must reach a terminal state and never silently hang.

Clarifying Questions to Ask

What clip lengths, resolutions, and frame rates must we support, and which of those are user-tunable vs fixed per tier?
What is the target throughput and the latency SLA (queue wait vs total time), and how does it differ across free / paid / preview tiers?
What conditioning inputs are in scope (text only, init image, mask, reference clip, audio)?
What are the safety/policy requirements (banned categories, real-person likeness, watermarking, provenance, audit retention)?
What is the retention policy for generated videos and intermediates, and are there data-residency or erasure (GDPR) obligations?
Is billing per-GPU-second, per-job, or subscription — and must we guarantee no double-billing on retries?

What a Strong Answer Covers

Async-first framing : justifies the queue-backed, GPU-scheduled architecture from the inference profile rather than defaulting to a synchronous request/response design.
Idempotent API + explicit job state machine with well-defined terminal states and a guard against silent hangs (per-job deadline / timeout).
Concrete failure-detection and ownership scheme : a real mechanism for noticing a dead worker and resolving who owns a job, with a defense against a returning/partitioned worker corrupting state — not a hand-wave at "we retry."
Failure taxonomy : distinguishes retryable vs permanent failures and reasons about the right action per class, including resource-exhaustion (OOM) failures and a cap on runaway / repeatedly-failing jobs.
Single-winner result publish : only one attempt's output can ever become the result, so partial or duplicate artifacts never leak to the user.
Scheduling for utilization and fairness : priority queues, weight affinity, batching tradeoffs, admission control / backpressure, and spot+on-demand autoscaling driven by wait-time.
Storage split : transactional metadata in a DB vs large artifacts in object storage, with lifecycle expiry and signed-URL delivery.
Two-stage safety (pre-prompt and post-frame) plus watermarking/provenance, and layered quotas (rate, concurrency, budget) enforced atomically.
Observability that separates queue wait from compute time and tracks model quality per version.
Sensible cost/latency reasoning : a defensible estimate of steady-state accelerator count from throughput and per-job time, and an argument for why duration/resolution scale super-linearly and are therefore gated parameters.

Follow-up Questions

A worker is network-partitioned, declared dead, its job retried and completed elsewhere — then the original worker reconnects and tries to publish its result. Walk through exactly how your design prevents a double publish or double bill.
80% of a 90 s job completes and then the spot instance is preempted. How do you avoid redoing the expensive work, and what are the limits of resuming a non-deterministic GPU computation?
Inference finished successfully but the post-processing (encode / upscale / watermark) stage crashed. What do you re-run, what must you not re-run, and why?
A single prompt deterministically crashes whatever worker picks it up (a "poison pill"). How does the system keep this from taking down the fleet, and how do you bill it?
Your GPU utilization drops to 60% while queues are full and wait times climb. What are the likely causes, and how does your monitoring let you localize the bug to scheduling/placement vs the model vs hardware?

Design a Sora-like text-to-video generation platform.

Constraints & Assumptions

State your own numbers in the interview — the figures below are illustrative, not benchmarks.

A single job is a short clip (e.g. 5–10 s at up to 720p, 24 fps), taking on the order of tens of seconds to a few minutes of GPU wall-time on one accelerator — 3–6 orders of magnitude slower than a typical web request.
The scarce, expensive resource is the GPU accelerator . API, queue, and database compute are cheap by comparison; the design optimizes for GPU utilization and fairness , not request throughput.
Jobs are long-running and stateful mid-flight . Worker crashes, OOMs, and spot-instance preemptions are routine , not exceptional — fault tolerance is a first-class requirement.
Output is regulated content : both the prompt and the generated frames must pass safety/moderation, and outputs typically require watermarking / provenance.
Assume a target sustained throughput (e.g. ~1 job/s) and multiple user tiers (free / paid / interactive preview) with different priority.
The system is fully async : every job must reach a terminal state and never silently hang.

Clarifying Questions to Ask

What clip lengths, resolutions, and frame rates must we support, and which of those are user-tunable vs fixed per tier?
What is the target throughput and the latency SLA (queue wait vs total time), and how does it differ across free / paid / preview tiers?
What conditioning inputs are in scope (text only, init image, mask, reference clip, audio)?
What are the safety/policy requirements (banned categories, real-person likeness, watermarking, provenance, audit retention)?
What is the retention policy for generated videos and intermediates, and are there data-residency or erasure (GDPR) obligations?
Is billing per-GPU-second, per-job, or subscription — and must we guarantee no double-billing on retries?

What a Strong Answer Covers

Async-first framing : justifies the queue-backed, GPU-scheduled architecture from the inference profile rather than defaulting to a synchronous request/response design.
Idempotent API + explicit job state machine with well-defined terminal states and a guard against silent hangs (per-job deadline / timeout).
Concrete failure-detection and ownership scheme : a real mechanism for noticing a dead worker and resolving who owns a job, with a defense against a returning/partitioned worker corrupting state — not a hand-wave at "we retry."
Failure taxonomy : distinguishes retryable vs permanent failures and reasons about the right action per class, including resource-exhaustion (OOM) failures and a cap on runaway / repeatedly-failing jobs.
Single-winner result publish : only one attempt's output can ever become the result, so partial or duplicate artifacts never leak to the user.
Scheduling for utilization and fairness : priority queues, weight affinity, batching tradeoffs, admission control / backpressure, and spot+on-demand autoscaling driven by wait-time.
Storage split : transactional metadata in a DB vs large artifacts in object storage, with lifecycle expiry and signed-URL delivery.
Two-stage safety (pre-prompt and post-frame) plus watermarking/provenance, and layered quotas (rate, concurrency, budget) enforced atomically.
Observability that separates queue wait from compute time and tracks model quality per version.
Sensible cost/latency reasoning : a defensible estimate of steady-state accelerator count from throughput and per-job time, and an argument for why duration/resolution scale super-linearly and are therefore gated parameters.

Follow-up Questions

A worker is network-partitioned, declared dead, its job retried and completed elsewhere — then the original worker reconnects and tries to publish its result. Walk through exactly how your design prevents a double publish or double bill.
80% of a 90 s job completes and then the spot instance is preempted. How do you avoid redoing the expensive work, and what are the limits of resuming a non-deterministic GPU computation?
Inference finished successfully but the post-processing (encode / upscale / watermark) stage crashed. What do you re-run, what must you not re-run, and why?
A single prompt deterministically crashes whatever worker picks it up (a "poison pill"). How does the system keep this from taking down the fleet, and how do you bill it?
Your GPU utilization drops to 60% while queues are full and wait times climb. What are the likely causes, and how does your monitoring let you localize the bug to scheduling/placement vs the model vs hardware?

Design a Text-to-Video Generation System

Quick Overview

Design a Text-to-Video Generation System

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Submit Your Answer to Earn 20XP

Design a Text-to-Video Generation System

Quick Overview

Design a Text-to-Video Generation System

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Submit Your Answer to Earn 20XP