Design Video Generation Orchestration
Company: OpenAI
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Technical Screen
Design a scalable system to **orchestrate AI video generation** (think Sora-style text-to-video).
Users submit text prompts to generate videos. For each submitted generation job, a user must be able to:
- View the status of **all of their video generations** throughout the day.
- Receive a **notification when a video generation finishes** (success or failure).
Video generation is **long-running** (seconds to several minutes per job) and **expensive** (GPU-bound, capacity-limited). Job submission and status queries, by contrast, must stay fast.
You do **not** need to design video storage or video serving — assume each finished video has a unique address (URL) that your system can store and reference. Focus your design on:
- The main **APIs** and **data model**.
- How a job moves through the system from **submission to completion**.
- Handling **high request volume** and **long-running** generation tasks.
- How users **query job status** efficiently.
- How **completion notifications** are delivered reliably.
- How **failures, retries, rate limits, and observability** work.
```hint Where to start
This is an asynchronous job-orchestration problem, not a request/response one. The submit API should accept the prompt, persist a job record, enqueue work, and return a `job_id` immediately — never block on generation. Sketch the producer → durable queue → worker → backend pipeline first.
```
```hint Data structure
Think about what transitions a job goes through from submission to completion. What does a "job" look like as a data structure in your DB, and how do you ensure that two workers (or a redelivered queue message) can't each advance the job independently?
```
```hint Reliability
Consider two independent failure points in the worker lifecycle: one where the worker finishes the generation but does not reach the notification step, and one where the worker never finishes at all. How does each failure mode get detected? How does each eventually resolve without manual intervention?
```
### Constraints & Assumptions
State your own numbers, but a reasonable working set:
- **Scale:** ~1M users, peak ~1,000 submissions/sec, but only a fraction generating concurrently. Status-page reads dominate writes (users poll/watch their in-flight jobs).
- **Generation time:** p50 ~30s, p99 several minutes. Jobs are heterogeneous (duration, resolution).
- **Capacity:** the GPU backend is the bottleneck — far fewer concurrent generation slots than queued jobs. Backpressure and fair scheduling matter.
- **Delivery semantics:** at-least-once notification delivery is acceptable; visible duplicates must be suppressed.
- **Out of scope:** storing/serving the video bytes, CDN, video encoding. You only persist and reference the final URL.
### Clarifying Questions to Ask
- Is the generation backend **synchronous** (call blocks until done) or **asynchronous** (returns a backend job ID + webhook/poll)? This drives the entire worker model.
- What notification channels are required — in-app only, or also email / mobile push / websocket?
- Do we need **job cancellation**, and can the model backend actually abort a running generation?
- Are there **priority tiers** (free vs. paid vs. internal) that affect scheduling and quotas?
- What are the per-user **rate limits and quotas** (requests/min, concurrent jobs, daily cap)?
- What status-freshness do clients expect — is a few seconds of staleness on the list view acceptable, or must reads be strongly consistent?
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- The generation backend is asynchronous and calls a **webhook** on completion, but webhooks can be lost or duplicated. How do you guarantee a job eventually reaches a terminal state without leaning solely on the webhook?
- GPU capacity is suddenly halved. How does your system shed/queue load fairly so paid users still get served and the queue doesn't grow unbounded?
- A user submits the same prompt three times within a second due to a flaky network and double-clicks. Walk through exactly how your design avoids three generations — and where the idempotency key is checked.
- You're seeing many jobs stuck in `RUNNING` for far longer than p99. How do you detect, attribute, and safely recover them without double-charging the user or double-notifying?
Quick Answer: This system design question tests a candidate's ability to architect asynchronous, distributed job-orchestration pipelines at scale. It evaluates practical knowledge of durable queues, state machines, reliability patterns like the outbox and idempotency, and trade-offs between throughput and latency in GPU-constrained systems.