Design a Hosted Notebook Platform
Company: OpenAI
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Technical Screen
Design a **hosted notebook platform** for interactive code execution — a cloud-based notebook service in the spirit of Google Colab, Deepnote, or Hex — where each user runs code in their browser against a live, isolated backend kernel.
Each user manages an isolated **workspace**, which bundles the user's **runtime environment** (kernel, installed packages, in-memory process state) and their **persisted notebook files**. Your design must support the following lifecycle operations:
- **Create** a workspace
- **Delete** a workspace
- **Suspend** a workspace (release expensive compute while preserving state)
- **Resume** a suspended workspace (bring it back to a usable state)
The platform must support **500,000 concurrent connected users**, and a **Resume** must complete in **under 5 seconds** for typical requests. Notebook files and any saved state must **survive** a suspend/resume cycle.
Walk through your design end-to-end: the high-level architecture, the split between control plane and runtime execution, the workspace lifecycle state machine, how you specifically hit the sub-5-second resume target, how you scale to 500K concurrent users cost-effectively, and the major failure modes with their mitigations.
```hint Where to start
This is a stateful-compute orchestration problem. Separate the **control plane** (metadata, scheduling, lifecycle state machine, routing) from the **data plane** (the sandboxes that actually run user code). Almost every later decision falls out of this split.
```
```hint The core latency lever
"Resume in <5s" rules out a full cold boot (image pull + container/microVM start + kernel init + dependency load can be tens of seconds). Ask what work you could do *before* the resume request arrives instead of on the critical path — what state can you pre-materialize or keep warm so resume becomes a fast attach rather than a rebuild?
```
```hint Decouple compute from data
If user files live on the runtime's local disk, every resume must rebuild them. Keep notebooks and data on **network-attached / distributed persistent storage** so resume only has to restore compute, not re-fetch the user's world. This also makes node failure recoverable.
```
```hint Don't build one giant cluster
500K concurrent is as much a blast-radius and scheduling problem as a raw-capacity one — think about how to partition it so one bad shard can't take down everyone, and how lifecycle transitions stay correct under retries and crashes. Also question the word "concurrent": how many of those users are actually executing code versus connected-but-idle, and what does that let you do?
```
### Constraints & Assumptions
- **Concurrency**: 500,000 concurrently *connected* users; assume a much smaller fraction (e.g. 10–30%) are actively executing code at any instant — the rest are idle-connected.
- **Resume SLA**: p95 resume latency < 5 s for "typical" (recently suspended, common-image) workspaces; cold-start fallbacks may be slower but must remain available.
- **Durability**: notebook files and explicitly saved state must persist indefinitely across suspend/resume and node loss; in-memory kernel state is best-effort.
- **Isolation**: users run arbitrary, untrusted code — multi-tenant isolation and noisy-neighbor protection are hard requirements.
- **Cost**: idle workspaces vastly outnumber active ones; idle cost must be near-zero (no dedicated reserved compute per idle user).
- Assume a global user base; multi-region is desirable but you may scope availability targets (zone vs. region) explicitly.
### Clarifying Questions to Ask
- Do workspaces need **GPUs/accelerators**, or is this CPU-only? Does hardware type vary per workspace and affect placement?
- Must **in-memory kernel state** (variables, loaded models) survive suspend, or only files on disk? This decides whether we need memory snapshots at all.
- What's the **idle policy** — do we auto-suspend after N minutes of inactivity, and is that user-configurable or tier-based?
- Are there **free vs. paid tiers** with different resource limits, isolation guarantees, or resume priorities?
- What are the **availability and durability targets** (single-zone, multi-zone, multi-region; RPO/RTO for user data)?
- What are the **package/customization** rules — fixed curated images, or arbitrary `pip install` that mutates the environment we must persist?
### What a Strong Answer Covers
This is a rubric of *dimensions a strong answer should address* — not a checklist of specific mechanisms. A senior candidate is judged on whether they reason about each dimension and justify their choices, not on naming a particular technology.
- **Separation of responsibilities**: a clear boundary between the component that *decides* what should happen to a workspace and the component that *runs* user code, and why that boundary makes the rest of the design tractable.
- **Lifecycle correctness**: an explicit state model for create/suspend/resume/delete that stays correct under retries, duplicate events, and worker crashes mid-transition.
- **The resume budget**: a reasoned account of where the <5 s goes and how the common path avoids the slow parts of a cold boot, with explicit fallbacks for when the fast path is unavailable and an honest latency breakdown.
- **Durability vs. compute statefulness**: a defensible stance on what must survive (files, saved state) versus what is best-effort (in-memory state), and how the storage design guarantees the former independent of which machine runs the code.
- **Isolation for untrusted code**: a justified isolation boundary and noisy-neighbor controls, with the trade-off between strength of isolation and overhead/start-up cost made explicit.
- **Scaling strategy**: how the architecture absorbs 500K users without a single global bottleneck or blast radius, and how it exploits the idle-heavy workload to control cost.
- **Failure handling**: graceful degradation across node / zone / region / metadata / queue / capacity / registry failures — shedding speed or in-memory state rather than data or availability.
- **Observability**: the SLO metrics and alerts that would actually tell you the resume SLA is at risk before users feel it.
- **The central trade-off**: explicit articulation of the tension between resume speed, scalability, and idle cost, and where this design lands on it.
### Follow-up Questions
- Your warm pool is exhausted during a regional demand spike (e.g. a viral course assignment). What degrades, in what order, and how do you protect the resume SLA for whoever you can?
- A user runs `pip install` that mutates the environment, then suspends. How do you guarantee that mutation survives resume without forcing a full cold rebuild every time?
- How would you support **GPU** workspaces given that warm-pooling expensive accelerators is far more costly than CPU — does your resume strategy change?
- How do you prevent a malicious user from escaping the sandbox or exhausting a node, and how do you detect and contain it at 500K scale?
Quick Answer: This question evaluates system design skills around stateful compute orchestration, including control-plane versus data-plane separation, workspace lifecycle management, fast resume latencies, and large-scale partitioning for a hosted interactive notebook platform.