Design a Text-to-Video Generation Platform (Sora-style)
Company: OpenAI
Role: Software Engineer
Category: ML System Design
Difficulty: hard
Interview Round: Onsite
## Design a Text-to-Video Generation Platform (Sora-style)
You are asked to design the backend platform that powers a **text-to-video generation** product. A user submits a natural-language prompt (optionally with a reference image, a target duration, an aspect ratio, and a style), and the system returns a short generated video produced by a large generative video model.
Treat the model itself as a given black box: it is a GPU-hungry diffusion/transformer model whose inference takes anywhere from tens of seconds to several minutes per clip. Your job is to design **everything around the model** — request submission, queuing, GPU-backed generation, input/output safety, storage, and delivery — so that it works correctly and economically at scale.
### Constraints & Assumptions
- ~10M registered users, ~1M daily active users.
- An active user generates ~5 clips/day → ~5M generation jobs/day, with peaks 3–5x the average.
- Each generation consumes 30s–5min of GPU time depending on requested duration and resolution.
- Output clips are up to ~20s, 720p–1080p, roughly 10–100 MB each.
- The GPU fleet is the scarce, expensive resource; the design must keep utilization high.
- Generation is asynchronous: users tolerate seconds-to-minutes latency but expect progress feedback.
- Safety is non-negotiable: disallowed prompts (e.g., sexual content involving minors, non-consensual real-person likenesses, graphic violence) must be blocked at **both** input and output.
### Clarifying Questions to Ask
- Is there a synchronous low-res preview, or is everything fully asynchronous? (Assume async with progress.)
- Are there free vs. paid tiers with different quotas and queue priorities?
- Do we need iterative editing (extend a clip, remix, regenerate a region), or only one-shot generation?
- What are the retention policies for generated videos and for the prompts themselves?
- Are there regional / data-residency or age-gating compliance requirements?
- Do we own model serving, or do we call an internal inference service that abstracts the GPUs?
### Part 1 — Public API and Job Lifecycle
Define the client-facing API and the full lifecycle of a generation job, from submission through to the delivered video.
```hint Where to start
Model generation as a long-running async job: the submit call returns a `job_id` immediately, and the client polls or subscribes for status. Don't block an HTTP request for minutes.
```
```hint Durability
The GPU is the scarce resource, so decouple *submission* from *execution* with a durable queue. The API tier should be cheap and stateless; the expensive work happens behind the queue.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 2 — Generation Pipeline and GPU Scheduling
Design how a queued job becomes a finished video: the worker pipeline, how work is dispatched to GPUs, and how you keep the expensive fleet busy.
```hint Separation
Split the control plane (orchestration, queue, metadata) from the data plane (GPU workers that pull jobs and run the model). The control plane is cheap and elastic; the data plane is expensive and capacity-bounded.
```
```hint Utilization
Keep GPUs busy: priority queues per tier, autoscale workers on queue depth / wait time, batch compatible jobs where the model allows, and consider preemption so paid jobs aren't stuck behind a long free-tier backlog.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### Part 3 — Safety, Storage, Delivery, and Observability
Design input/output safety, where the generated videos live and how they reach users, and what you monitor.
```hint Two-sided safety
Safety is two stages, not one: classify/clean the **prompt** before generation, and moderate the **output** (sampled frames + audio) after generation, before the video is ever made viewable.
```
```hint Heavy bytes
Keep the large video bytes out of your database — store them in object storage fronted by a CDN, and keep only metadata + a storage key in the DB.
```
#### What This Part Should Cover
```premium-lock What This Part Should Cover
```
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- How would you support priority tiers (free vs. paid) without indefinitely starving free users?
- A generation takes 4 minutes and the worker crashes at minute 3 — exactly what happens, and who notices?
- How would you add an "extend this clip" / iterative-editing feature on top of this design?
- How would you roll out and A/B test a new, more expensive model version safely without blowing the GPU budget?
Quick Answer: This ML system design question evaluates a candidate's ability to architect the infrastructure around a large generative video model, including asynchronous job orchestration, GPU fleet scheduling, and content safety pipelines. It tests practical application of distributed systems concepts such as durable queuing, idempotency, and utilization-aware autoscaling for expensive compute resources, a common theme in senior ML infrastructure interviews.