How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a hard difficulty System Design question, commonly asked during Technical Screen rounds at Google.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at Google during technical interviews.

Design a Task Scheduler for Opaque Long-Running GPU Jobs ("Design Sora")

Q: Design a Task Scheduler for Opaque Long-Running GPU Jobs ("Design Sora")

This system design question evaluates a candidate's ability to architect a distributed job scheduler and orchestrator for expensive, long-running, GPU-bound tasks. It tests core competencies in asynchronous processing, multi-tenant fairness, fault tolerance, and scalable resource management — skills commonly probed at the senior software engineer level.

Design a task scheduler that runs long-running, opaque video-generation jobs at scale (the "Design Sora" prompt).

Your platform lets users submit a text prompt and receive a generated video back. The actual generation is performed by a black-box binary: you hand it a prompt (plus parameters such as resolution, duration, and seed) and, some minutes later, it emits a video file. You do not control or modify that binary — treat it as an opaque, GPU-bound task that takes a highly variable amount of time (seconds to many minutes), consumes a whole GPU (or several) while it runs, and may crash, hang, or run out of memory.

Your job is to design the system around that binary: accept user submissions, queue them, schedule them onto a fleet of GPU workers, track each job's lifecycle, handle failures and retries, enforce fairness/priority across users, and deliver the finished video back to the user. In other words, the interesting problem here is a distributed job scheduler / orchestrator for expensive, long-running, unreliable tasks — the video model itself is intentionally a black box.

Constraints & Assumptions

State your own numbers; reasonable defaults to anchor on:

~1–5M submitted jobs/day (tens of jobs/sec average, with bursty peaks of several hundred/sec).
Each job occupies 1+ GPUs for a p50 of ~1–2 min and a p99 of ~10+ min; jobs are not preemptible mid-generation in the simple version (the binary is a black box).
A fleet of thousands of GPUs across multiple regions/zones; GPUs are the scarce, expensive resource — target high utilization.
Output videos are tens to hundreds of MB; store in blob storage and serve via signed URLs / CDN.
Multi-tenant: per-user/per-org quotas and priority tiers (e.g., free vs. paid) must be enforced; no single user may starve others.
Availability target for the control plane (submit/status) ~99.9%+; an individual job may fail and be retried, but a job must never be silently lost.

Clarifying Questions to Ask

Is generation strictly asynchronous (submit now, retrieve later), or is there any interactive/streaming preview requirement?
What are the priority/fairness rules across tenants — strict tiers, weighted fair sharing, or per-user concurrency caps?
What is the expected GPU footprint per job, and can a job span multiple GPUs/nodes, or is it always single-GPU?
What should happen to a job that exceeds a wall-clock budget — kill and retry, kill and fail, or let it run?
What are the retention and access-control requirements for generated videos (who can fetch a result, for how long)?
Are there hard cost ceilings or per-user spend caps that the scheduler must enforce?

What a Strong Answer Covers Premium

Follow-up Questions

A class of prompts reliably makes the binary hang and burn a GPU for 30 minutes before timing out. How do you detect and contain this so it does not degrade everyone else's latency?
The black-box binary releases a new version with different GPU-memory and runtime characteristics. How do you roll it out safely across the fleet without a global outage or a thundering-herd of retries?
Paying customers complain that during traffic spikes their jobs sit behind a flood of free-tier jobs. Concretely, how does your fairness/priority mechanism fix this, and what are its failure modes?
How would you add a per-user, real-time spend cap that stops scheduling new jobs once a budget is hit, given that you only learn a job's true cost after it finishes?

Design a task scheduler that runs long-running, opaque video-generation jobs at scale (the "Design Sora" prompt).

Constraints & Assumptions

State your own numbers; reasonable defaults to anchor on:

~1–5M submitted jobs/day (tens of jobs/sec average, with bursty peaks of several hundred/sec).
Each job occupies 1+ GPUs for a p50 of ~1–2 min and a p99 of ~10+ min; jobs are not preemptible mid-generation in the simple version (the binary is a black box).
A fleet of thousands of GPUs across multiple regions/zones; GPUs are the scarce, expensive resource — target high utilization.
Output videos are tens to hundreds of MB; store in blob storage and serve via signed URLs / CDN.
Multi-tenant: per-user/per-org quotas and priority tiers (e.g., free vs. paid) must be enforced; no single user may starve others.
Availability target for the control plane (submit/status) ~99.9%+; an individual job may fail and be retried, but a job must never be silently lost.

Clarifying Questions to Ask

Is generation strictly asynchronous (submit now, retrieve later), or is there any interactive/streaming preview requirement?
What are the priority/fairness rules across tenants — strict tiers, weighted fair sharing, or per-user concurrency caps?
What is the expected GPU footprint per job, and can a job span multiple GPUs/nodes, or is it always single-GPU?
What should happen to a job that exceeds a wall-clock budget — kill and retry, kill and fail, or let it run?
What are the retention and access-control requirements for generated videos (who can fetch a result, for how long)?
Are there hard cost ceilings or per-user spend caps that the scheduler must enforce?

What a Strong Answer Covers Premium

Follow-up Questions

A class of prompts reliably makes the binary hang and burn a GPU for 30 minutes before timing out. How do you detect and contain this so it does not degrade everyone else's latency?
The black-box binary releases a new version with different GPU-memory and runtime characteristics. How do you roll it out safely across the fleet without a global outage or a thundering-herd of retries?
Paying customers complain that during traffic spikes their jobs sit behind a flood of free-tier jobs. Concretely, how does your fairness/priority mechanism fix this, and what are its failure modes?
How would you add a per-user, real-time spend cap that stops scheduling new jobs once a budget is hit, given that you only learn a job's true cost after it finishes?

Design a Task Scheduler for Opaque Long-Running GPU Jobs ("Design Sora")

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers Premium

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP

Design a Task Scheduler for Opaque Long-Running GPU Jobs ("Design Sora")

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers Premium

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP