How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

What difficulty level is this interview question?

This is a hard difficulty ML System Design question, commonly asked during Onsite rounds at OpenAI.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at OpenAI during technical interviews.

Design a Text-to-Video Generation Platform (Sora-style)

Q: Design a Text-to-Video Generation Platform (Sora-style)

This ML system design question evaluates a candidate's ability to architect the infrastructure around a large generative video model, including asynchronous job orchestration, GPU fleet scheduling, and content safety pipelines. It tests practical application of distributed systems concepts such as durable queuing, idempotency, and utilization-aware autoscaling for expensive compute resources, a common theme in senior ML infrastructure interviews.

Design a Text-to-Video Generation Platform (Sora-style)

You are asked to design the backend platform that powers a text-to-video generation product. A user submits a natural-language prompt (optionally with a reference image, a target duration, an aspect ratio, and a style), and the system returns a short generated video produced by a large generative video model.

Treat the model itself as a given black box: it is a GPU-hungry diffusion/transformer model whose inference takes anywhere from tens of seconds to several minutes per clip. Your job is to design everything around the model — request submission, queuing, GPU-backed generation, input/output safety, storage, and delivery — so that it works correctly and economically at scale.

Constraints & Assumptions

~10M registered users, ~1M daily active users.
An active user generates ~5 clips/day → ~5M generation jobs/day, with peaks 3–5x the average.
Each generation consumes 30s–5min of GPU time depending on requested duration and resolution.
Output clips are up to ~20s, 720p–1080p, roughly 10–100 MB each.
The GPU fleet is the scarce, expensive resource; the design must keep utilization high.
Generation is asynchronous: users tolerate seconds-to-minutes latency but expect progress feedback.
Safety is non-negotiable: disallowed prompts (e.g., sexual content involving minors, non-consensual real-person likenesses, graphic violence) must be blocked at both input and output.

Clarifying Questions to Ask

Is there a synchronous low-res preview, or is everything fully asynchronous? (Assume async with progress.)
Are there free vs. paid tiers with different quotas and queue priorities?
Do we need iterative editing (extend a clip, remix, regenerate a region), or only one-shot generation?
What are the retention policies for generated videos and for the prompts themselves?
Are there regional / data-residency or age-gating compliance requirements?
Do we own model serving, or do we call an internal inference service that abstracts the GPUs?

Part 1 — Public API and Job Lifecycle

Define the client-facing API and the full lifecycle of a generation job, from submission through to the delivered video.

What This Part Should Cover Premium

Part 2 — Generation Pipeline and GPU Scheduling

Design how a queued job becomes a finished video: the worker pipeline, how work is dispatched to GPUs, and how you keep the expensive fleet busy.

What This Part Should Cover Premium

Part 3 — Safety, Storage, Delivery, and Observability

Design input/output safety, where the generated videos live and how they reach users, and what you monitor.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

How would you support priority tiers (free vs. paid) without indefinitely starving free users?
A generation takes 4 minutes and the worker crashes at minute 3 — exactly what happens, and who notices?
How would you add an "extend this clip" / iterative-editing feature on top of this design?
How would you roll out and A/B test a new, more expensive model version safely without blowing the GPU budget?

Design a Text-to-Video Generation Platform (Sora-style)

Constraints & Assumptions

~10M registered users, ~1M daily active users.
An active user generates ~5 clips/day → ~5M generation jobs/day, with peaks 3–5x the average.
Each generation consumes 30s–5min of GPU time depending on requested duration and resolution.
Output clips are up to ~20s, 720p–1080p, roughly 10–100 MB each.
The GPU fleet is the scarce, expensive resource; the design must keep utilization high.
Generation is asynchronous: users tolerate seconds-to-minutes latency but expect progress feedback.
Safety is non-negotiable: disallowed prompts (e.g., sexual content involving minors, non-consensual real-person likenesses, graphic violence) must be blocked at both input and output.

Clarifying Questions to Ask

Is there a synchronous low-res preview, or is everything fully asynchronous? (Assume async with progress.)
Are there free vs. paid tiers with different quotas and queue priorities?
Do we need iterative editing (extend a clip, remix, regenerate a region), or only one-shot generation?
What are the retention policies for generated videos and for the prompts themselves?
Are there regional / data-residency or age-gating compliance requirements?
Do we own model serving, or do we call an internal inference service that abstracts the GPUs?

Part 1 — Public API and Job Lifecycle

Define the client-facing API and the full lifecycle of a generation job, from submission through to the delivered video.

What This Part Should Cover Premium

Part 2 — Generation Pipeline and GPU Scheduling

Design how a queued job becomes a finished video: the worker pipeline, how work is dispatched to GPUs, and how you keep the expensive fleet busy.

What This Part Should Cover Premium

Part 3 — Safety, Storage, Delivery, and Observability

Design input/output safety, where the generated videos live and how they reach users, and what you monitor.

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

How would you support priority tiers (free vs. paid) without indefinitely starving free users?
A generation takes 4 minutes and the worker crashes at minute 3 — exactly what happens, and who notices?
How would you add an "extend this clip" / iterative-editing feature on top of this design?
How would you roll out and A/B test a new, more expensive model version safely without blowing the GPU budget?

Design a Text-to-Video Generation Platform (Sora-style)

Quick Overview

Design a Text-to-Video Generation Platform (Sora-style)

Design a Text-to-Video Generation Platform (Sora-style)

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Public API and Job Lifecycle

What This Part Should Cover Premium

Part 2 — Generation Pipeline and GPU Scheduling

What This Part Should Cover Premium

Part 3 — Safety, Storage, Delivery, and Observability

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP

Design a Text-to-Video Generation Platform (Sora-style)

Quick Overview

Design a Text-to-Video Generation Platform (Sora-style)

Design a Text-to-Video Generation Platform (Sora-style)

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Public API and Job Lifecycle

What This Part Should Cover Premium

Part 2 — Generation Pipeline and GPU Scheduling

What This Part Should Cover Premium

Part 3 — Safety, Storage, Delivery, and Observability

What This Part Should Cover Premium

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP