How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a medium difficulty System Design question, commonly asked during Technical Screen rounds at OpenAI.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at OpenAI during technical interviews.

Design Video Generation Orchestration | OpenAI Interview Question

Q: Design Video Generation Orchestration

This system design question tests a candidate's ability to architect asynchronous, distributed job-orchestration pipelines at scale. It evaluates practical knowledge of durable queues, state machines, reliability patterns like the outbox and idempotency, and trade-offs between throughput and latency in GPU-constrained systems.

Design a scalable system to orchestrate AI video generation (think Sora-style text-to-video).

Users submit text prompts to generate videos. For each submitted generation job, a user must be able to:

View the status of all of their video generations throughout the day.
Receive a notification when a video generation finishes (success or failure).

Video generation is long-running (seconds to several minutes per job) and expensive (GPU-bound, capacity-limited). Job submission and status queries, by contrast, must stay fast.

You do not need to design video storage or video serving — assume each finished video has a unique address (URL) that your system can store and reference. Focus your design on:

The main APIs and data model .
How a job moves through the system from submission to completion .
Handling high request volume and long-running generation tasks.
How users query job status efficiently.
How completion notifications are delivered reliably.
How failures, retries, rate limits, and observability work.

Constraints & Assumptions

State your own numbers, but a reasonable working set:

Scale: ~1M users, peak ~1,000 submissions/sec, but only a fraction generating concurrently. Status-page reads dominate writes (users poll/watch their in-flight jobs).
Generation time: p50 ~30s, p99 several minutes. Jobs are heterogeneous (duration, resolution).
Capacity: the GPU backend is the bottleneck — far fewer concurrent generation slots than queued jobs. Backpressure and fair scheduling matter.
Delivery semantics: at-least-once notification delivery is acceptable; visible duplicates must be suppressed.
Out of scope: storing/serving the video bytes, CDN, video encoding. You only persist and reference the final URL.

Clarifying Questions to Ask

Is the generation backend synchronous (call blocks until done) or asynchronous (returns a backend job ID + webhook/poll)? This drives the entire worker model.
What notification channels are required — in-app only, or also email / mobile push / websocket?
Do we need job cancellation , and can the model backend actually abort a running generation?
Are there priority tiers (free vs. paid vs. internal) that affect scheduling and quotas?
What are the per-user rate limits and quotas (requests/min, concurrent jobs, daily cap)?
What status-freshness do clients expect — is a few seconds of staleness on the list view acceptable, or must reads be strongly consistent?

What a Strong Answer Covers Premium

Follow-up Questions

The generation backend is asynchronous and calls a webhook on completion, but webhooks can be lost or duplicated. How do you guarantee a job eventually reaches a terminal state without leaning solely on the webhook?
GPU capacity is suddenly halved. How does your system shed/queue load fairly so paid users still get served and the queue doesn't grow unbounded?
A user submits the same prompt three times within a second due to a flaky network and double-clicks. Walk through exactly how your design avoids three generations — and where the idempotency key is checked.
You're seeing many jobs stuck in RUNNING for far longer than p99. How do you detect, attribute, and safely recover them without double-charging the user or double-notifying?

Design a scalable system to orchestrate AI video generation (think Sora-style text-to-video).

Users submit text prompts to generate videos. For each submitted generation job, a user must be able to:

View the status of all of their video generations throughout the day.
Receive a notification when a video generation finishes (success or failure).

Video generation is long-running (seconds to several minutes per job) and expensive (GPU-bound, capacity-limited). Job submission and status queries, by contrast, must stay fast.

You do not need to design video storage or video serving — assume each finished video has a unique address (URL) that your system can store and reference. Focus your design on:

The main APIs and data model .
How a job moves through the system from submission to completion .
Handling high request volume and long-running generation tasks.
How users query job status efficiently.
How completion notifications are delivered reliably.
How failures, retries, rate limits, and observability work.

Constraints & Assumptions

State your own numbers, but a reasonable working set:

Scale: ~1M users, peak ~1,000 submissions/sec, but only a fraction generating concurrently. Status-page reads dominate writes (users poll/watch their in-flight jobs).
Generation time: p50 ~30s, p99 several minutes. Jobs are heterogeneous (duration, resolution).
Capacity: the GPU backend is the bottleneck — far fewer concurrent generation slots than queued jobs. Backpressure and fair scheduling matter.
Delivery semantics: at-least-once notification delivery is acceptable; visible duplicates must be suppressed.
Out of scope: storing/serving the video bytes, CDN, video encoding. You only persist and reference the final URL.

Clarifying Questions to Ask

Is the generation backend synchronous (call blocks until done) or asynchronous (returns a backend job ID + webhook/poll)? This drives the entire worker model.
What notification channels are required — in-app only, or also email / mobile push / websocket?
Do we need job cancellation , and can the model backend actually abort a running generation?
Are there priority tiers (free vs. paid vs. internal) that affect scheduling and quotas?
What are the per-user rate limits and quotas (requests/min, concurrent jobs, daily cap)?
What status-freshness do clients expect — is a few seconds of staleness on the list view acceptable, or must reads be strongly consistent?

What a Strong Answer Covers Premium

Follow-up Questions

The generation backend is asynchronous and calls a webhook on completion, but webhooks can be lost or duplicated. How do you guarantee a job eventually reaches a terminal state without leaning solely on the webhook?
GPU capacity is suddenly halved. How does your system shed/queue load fairly so paid users still get served and the queue doesn't grow unbounded?
A user submits the same prompt three times within a second due to a flaky network and double-clicks. Walk through exactly how your design avoids three generations — and where the idempotency key is checked.
You're seeing many jobs stuck in RUNNING for far longer than p99. How do you detect, attribute, and safely recover them without double-charging the user or double-notifying?

Design Video Generation Orchestration

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers Premium

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP

Design Video Generation Orchestration

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers Premium

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP