Design a Task Scheduler for Opaque Long-Running GPU Jobs ("Design Sora")
Company: Google
Role: Software Engineer
Category: System Design
Difficulty: hard
Interview Round: Technical Screen
Design a task scheduler that runs long-running, opaque video-generation jobs at scale (the "Design Sora" prompt).
Your platform lets users submit a text prompt and receive a generated video back. The actual generation is performed by a **black-box binary**: you hand it a prompt (plus parameters such as resolution, duration, and seed) and, some minutes later, it emits a video file. You do **not** control or modify that binary — treat it as an opaque, GPU-bound task that takes a highly variable amount of time (seconds to many minutes), consumes a whole GPU (or several) while it runs, and may crash, hang, or run out of memory.
Your job is to design the **system around** that binary: accept user submissions, queue them, schedule them onto a fleet of GPU workers, track each job's lifecycle, handle failures and retries, enforce fairness/priority across users, and deliver the finished video back to the user. In other words, the interesting problem here is a **distributed job scheduler / orchestrator for expensive, long-running, unreliable tasks** — the video model itself is intentionally a black box.
```hint Reframe the prompt
The phrase "Design Sora" is a red herring — you are **not** designing a video model. Restate the problem out loud as "design a distributed scheduler for long-running, GPU-bound, opaque tasks" and design to that. The binary is just a unit of work with a duration, a resource footprint, and a failure rate.
```
```hint Decompose the lifecycle
Walk a single job through every state: `submitted → queued → scheduled → running → (succeeded | failed | timed_out | cancelled)`. Each transition is where the hard questions live (admission/quotas, queueing/fairness, placement onto GPUs, heartbeating a running job, retry vs. dead-letter, result delivery).
```
```hint Decouple the slow part
Generation takes minutes, so the submit API must be **asynchronous**: accept the request, persist a job row, return a `job_id` immediately, and let the user poll or get notified. Never hold an HTTP request open for the duration of generation.
```
### Constraints & Assumptions
State your own numbers; reasonable defaults to anchor on:
- ~1–5M submitted jobs/day (tens of jobs/sec average, with bursty peaks of several hundred/sec).
- Each job occupies 1+ GPUs for a p50 of ~1–2 min and a p99 of ~10+ min; jobs are **not** preemptible mid-generation in the simple version (the binary is a black box).
- A fleet of thousands of GPUs across multiple regions/zones; GPUs are the scarce, expensive resource — target high utilization.
- Output videos are tens to hundreds of MB; store in blob storage and serve via signed URLs / CDN.
- Multi-tenant: per-user/per-org quotas and priority tiers (e.g., free vs. paid) must be enforced; no single user may starve others.
- Availability target for the control plane (submit/status) ~99.9%+; an individual job may fail and be retried, but a job must never be silently lost.
### Clarifying Questions to Ask
- Is generation strictly asynchronous (submit now, retrieve later), or is there any interactive/streaming preview requirement?
- What are the priority/fairness rules across tenants — strict tiers, weighted fair sharing, or per-user concurrency caps?
- What is the expected GPU footprint per job, and can a job span multiple GPUs/nodes, or is it always single-GPU?
- What should happen to a job that exceeds a wall-clock budget — kill and retry, kill and fail, or let it run?
- What are the retention and access-control requirements for generated videos (who can fetch a result, for how long)?
- Are there hard cost ceilings or per-user spend caps that the scheduler must enforce?
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- A class of prompts reliably makes the binary hang and burn a GPU for 30 minutes before timing out. How do you detect and contain this so it does not degrade everyone else's latency?
- The black-box binary releases a new version with different GPU-memory and runtime characteristics. How do you roll it out safely across the fleet without a global outage or a thundering-herd of retries?
- Paying customers complain that during traffic spikes their jobs sit behind a flood of free-tier jobs. Concretely, how does your fairness/priority mechanism fix this, and what are its failure modes?
- How would you add a per-user, real-time spend cap that stops scheduling new jobs once a budget is hit, given that you only learn a job's true cost after it finishes?
Quick Answer: This system design question evaluates a candidate's ability to architect a distributed job scheduler and orchestrator for expensive, long-running, GPU-bound tasks. It tests core competencies in asynchronous processing, multi-tenant fairness, fault tolerance, and scalable resource management — skills commonly probed at the senior software engineer level.