PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/OpenAI

Design Video Generation Orchestration

Last updated: Jun 19, 2026

Quick Overview

This system design question tests a candidate's ability to architect asynchronous, distributed job-orchestration pipelines at scale. It evaluates practical knowledge of durable queues, state machines, reliability patterns like the outbox and idempotency, and trade-offs between throughput and latency in GPU-constrained systems.

  • medium
  • OpenAI
  • System Design
  • Software Engineer

Design Video Generation Orchestration

Company: OpenAI

Role: Software Engineer

Category: System Design

Difficulty: medium

Interview Round: Technical Screen

Design a scalable system to **orchestrate AI video generation** (think Sora-style text-to-video). Users submit text prompts to generate videos. For each submitted generation job, a user must be able to: - View the status of **all of their video generations** throughout the day. - Receive a **notification when a video generation finishes** (success or failure). Video generation is **long-running** (seconds to several minutes per job) and **expensive** (GPU-bound, capacity-limited). Job submission and status queries, by contrast, must stay fast. You do **not** need to design video storage or video serving — assume each finished video has a unique address (URL) that your system can store and reference. Focus your design on: - The main **APIs** and **data model**. - How a job moves through the system from **submission to completion**. - Handling **high request volume** and **long-running** generation tasks. - How users **query job status** efficiently. - How **completion notifications** are delivered reliably. - How **failures, retries, rate limits, and observability** work. ```hint Where to start This is an asynchronous job-orchestration problem, not a request/response one. The submit API should accept the prompt, persist a job record, enqueue work, and return a `job_id` immediately — never block on generation. Sketch the producer → durable queue → worker → backend pipeline first. ``` ```hint Data structure Think about what transitions a job goes through from submission to completion. What does a "job" look like as a data structure in your DB, and how do you ensure that two workers (or a redelivered queue message) can't each advance the job independently? ``` ```hint Reliability Consider two independent failure points in the worker lifecycle: one where the worker finishes the generation but does not reach the notification step, and one where the worker never finishes at all. How does each failure mode get detected? How does each eventually resolve without manual intervention? ``` ### Constraints & Assumptions State your own numbers, but a reasonable working set: - **Scale:** ~1M users, peak ~1,000 submissions/sec, but only a fraction generating concurrently. Status-page reads dominate writes (users poll/watch their in-flight jobs). - **Generation time:** p50 ~30s, p99 several minutes. Jobs are heterogeneous (duration, resolution). - **Capacity:** the GPU backend is the bottleneck — far fewer concurrent generation slots than queued jobs. Backpressure and fair scheduling matter. - **Delivery semantics:** at-least-once notification delivery is acceptable; visible duplicates must be suppressed. - **Out of scope:** storing/serving the video bytes, CDN, video encoding. You only persist and reference the final URL. ### Clarifying Questions to Ask - Is the generation backend **synchronous** (call blocks until done) or **asynchronous** (returns a backend job ID + webhook/poll)? This drives the entire worker model. - What notification channels are required — in-app only, or also email / mobile push / websocket? - Do we need **job cancellation**, and can the model backend actually abort a running generation? - Are there **priority tiers** (free vs. paid vs. internal) that affect scheduling and quotas? - What are the per-user **rate limits and quotas** (requests/min, concurrent jobs, daily cap)? - What status-freshness do clients expect — is a few seconds of staleness on the list view acceptable, or must reads be strongly consistent? ### What a Strong Answer Covers ```premium-lock What a Strong Answer Covers ``` ### Follow-up Questions - The generation backend is asynchronous and calls a **webhook** on completion, but webhooks can be lost or duplicated. How do you guarantee a job eventually reaches a terminal state without leaning solely on the webhook? - GPU capacity is suddenly halved. How does your system shed/queue load fairly so paid users still get served and the queue doesn't grow unbounded? - A user submits the same prompt three times within a second due to a flaky network and double-clicks. Walk through exactly how your design avoids three generations — and where the idempotency key is checked. - You're seeing many jobs stuck in `RUNNING` for far longer than p99. How do you detect, attribute, and safely recover them without double-charging the user or double-notifying?

Quick Answer: This system design question tests a candidate's ability to architect asynchronous, distributed job-orchestration pipelines at scale. It evaluates practical knowledge of durable queues, state machines, reliability patterns like the outbox and idempotency, and trade-offs between throughput and latency in GPU-constrained systems.

Related Interview Questions

  • Design CI/CD Build Caching - OpenAI
  • Design an Instagram-like Feed System - OpenAI (medium)
  • Design Online Chess Matchmaking - OpenAI (hard)
  • Design Android MVVM API Architecture - OpenAI (medium)
  • Design a Distributed Crossword Solver - OpenAI (hard)
OpenAI logo
OpenAI
Jun 12, 2026, 12:00 AM
Software Engineer
Technical Screen
System Design
0
0

Design a scalable system to orchestrate AI video generation (think Sora-style text-to-video).

Users submit text prompts to generate videos. For each submitted generation job, a user must be able to:

  • View the status of all of their video generations throughout the day.
  • Receive a notification when a video generation finishes (success or failure).

Video generation is long-running (seconds to several minutes per job) and expensive (GPU-bound, capacity-limited). Job submission and status queries, by contrast, must stay fast.

You do not need to design video storage or video serving — assume each finished video has a unique address (URL) that your system can store and reference. Focus your design on:

  • The main APIs and data model .
  • How a job moves through the system from submission to completion .
  • Handling high request volume and long-running generation tasks.
  • How users query job status efficiently.
  • How completion notifications are delivered reliably.
  • How failures, retries, rate limits, and observability work.

Constraints & Assumptions

State your own numbers, but a reasonable working set:

  • Scale: ~1M users, peak ~1,000 submissions/sec, but only a fraction generating concurrently. Status-page reads dominate writes (users poll/watch their in-flight jobs).
  • Generation time: p50 ~30s, p99 several minutes. Jobs are heterogeneous (duration, resolution).
  • Capacity: the GPU backend is the bottleneck — far fewer concurrent generation slots than queued jobs. Backpressure and fair scheduling matter.
  • Delivery semantics: at-least-once notification delivery is acceptable; visible duplicates must be suppressed.
  • Out of scope: storing/serving the video bytes, CDN, video encoding. You only persist and reference the final URL.

Clarifying Questions to Ask

  • Is the generation backend synchronous (call blocks until done) or asynchronous (returns a backend job ID + webhook/poll)? This drives the entire worker model.
  • What notification channels are required — in-app only, or also email / mobile push / websocket?
  • Do we need job cancellation , and can the model backend actually abort a running generation?
  • Are there priority tiers (free vs. paid vs. internal) that affect scheduling and quotas?
  • What are the per-user rate limits and quotas (requests/min, concurrent jobs, daily cap)?
  • What status-freshness do clients expect — is a few seconds of staleness on the list view acceptable, or must reads be strongly consistent?

What a Strong Answer Covers Premium

Follow-up Questions

  • The generation backend is asynchronous and calls a webhook on completion, but webhooks can be lost or duplicated. How do you guarantee a job eventually reaches a terminal state without leaning solely on the webhook?
  • GPU capacity is suddenly halved. How does your system shed/queue load fairly so paid users still get served and the queue doesn't grow unbounded?
  • A user submits the same prompt three times within a second due to a flaky network and double-clicks. Walk through exactly how your design avoids three generations — and where the idempotency key is checked.
  • You're seeing many jobs stuck in RUNNING for far longer than p99. How do you detect, attribute, and safely recover them without double-charging the user or double-notifying?

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More OpenAI•More Software Engineer•OpenAI Software Engineer•OpenAI System Design•Software Engineer System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.