How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a medium difficulty System Design question, commonly asked during Technical Screen rounds at OpenAI.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at OpenAI during technical interviews.

Design a Hosted Notebook Platform | OpenAI Interview Question

Q: Design a Hosted Notebook Platform

This question evaluates system design skills around stateful compute orchestration, including control-plane versus data-plane separation, workspace lifecycle management, fast resume latencies, and large-scale partitioning for a hosted interactive notebook platform.

Design a hosted notebook platform for interactive code execution — a cloud-based notebook service in the spirit of Google Colab, Deepnote, or Hex — where each user runs code in their browser against a live, isolated backend kernel.

Each user manages an isolated workspace, which bundles the user's runtime environment (kernel, installed packages, in-memory process state) and their persisted notebook files. Your design must support the following lifecycle operations:

Create a workspace
Delete a workspace
Suspend a workspace (release expensive compute while preserving state)
Resume a suspended workspace (bring it back to a usable state)

The platform must support 500,000 concurrent connected users, and a Resume must complete in under 5 seconds for typical requests. Notebook files and any saved state must survive a suspend/resume cycle.

Walk through your design end-to-end: the high-level architecture, the split between control plane and runtime execution, the workspace lifecycle state machine, how you specifically hit the sub-5-second resume target, how you scale to 500K concurrent users cost-effectively, and the major failure modes with their mitigations.

Constraints & Assumptions

Concurrency : 500,000 concurrently connected users; assume a much smaller fraction (e.g. 10–30%) are actively executing code at any instant — the rest are idle-connected.
Resume SLA : p95 resume latency < 5 s for "typical" (recently suspended, common-image) workspaces; cold-start fallbacks may be slower but must remain available.
Durability : notebook files and explicitly saved state must persist indefinitely across suspend/resume and node loss; in-memory kernel state is best-effort.
Isolation : users run arbitrary, untrusted code — multi-tenant isolation and noisy-neighbor protection are hard requirements.
Cost : idle workspaces vastly outnumber active ones; idle cost must be near-zero (no dedicated reserved compute per idle user).
Assume a global user base; multi-region is desirable but you may scope availability targets (zone vs. region) explicitly.

Clarifying Questions to Ask

Do workspaces need GPUs/accelerators , or is this CPU-only? Does hardware type vary per workspace and affect placement?
Must in-memory kernel state (variables, loaded models) survive suspend, or only files on disk? This decides whether we need memory snapshots at all.
What's the idle policy — do we auto-suspend after N minutes of inactivity, and is that user-configurable or tier-based?
Are there free vs. paid tiers with different resource limits, isolation guarantees, or resume priorities?
What are the availability and durability targets (single-zone, multi-zone, multi-region; RPO/RTO for user data)?
What are the package/customization rules — fixed curated images, or arbitrary pip install that mutates the environment we must persist?

What a Strong Answer Covers

This is a rubric of dimensions a strong answer should address — not a checklist of specific mechanisms. A senior candidate is judged on whether they reason about each dimension and justify their choices, not on naming a particular technology.

Separation of responsibilities : a clear boundary between the component that decides what should happen to a workspace and the component that runs user code, and why that boundary makes the rest of the design tractable.
Lifecycle correctness : an explicit state model for create/suspend/resume/delete that stays correct under retries, duplicate events, and worker crashes mid-transition.
The resume budget : a reasoned account of where the <5 s goes and how the common path avoids the slow parts of a cold boot, with explicit fallbacks for when the fast path is unavailable and an honest latency breakdown.
Durability vs. compute statefulness : a defensible stance on what must survive (files, saved state) versus what is best-effort (in-memory state), and how the storage design guarantees the former independent of which machine runs the code.
Isolation for untrusted code : a justified isolation boundary and noisy-neighbor controls, with the trade-off between strength of isolation and overhead/start-up cost made explicit.
Scaling strategy : how the architecture absorbs 500K users without a single global bottleneck or blast radius, and how it exploits the idle-heavy workload to control cost.
Failure handling : graceful degradation across node / zone / region / metadata / queue / capacity / registry failures — shedding speed or in-memory state rather than data or availability.
Observability : the SLO metrics and alerts that would actually tell you the resume SLA is at risk before users feel it.
The central trade-off : explicit articulation of the tension between resume speed, scalability, and idle cost, and where this design lands on it.

Follow-up Questions

Your warm pool is exhausted during a regional demand spike (e.g. a viral course assignment). What degrades, in what order, and how do you protect the resume SLA for whoever you can?
A user runs pip install that mutates the environment, then suspends. How do you guarantee that mutation survives resume without forcing a full cold rebuild every time?
How would you support GPU workspaces given that warm-pooling expensive accelerators is far more costly than CPU — does your resume strategy change?
How do you prevent a malicious user from escaping the sandbox or exhausting a node, and how do you detect and contain it at 500K scale?

Create a workspace
Delete a workspace
Suspend a workspace (release expensive compute while preserving state)
Resume a suspended workspace (bring it back to a usable state)

Constraints & Assumptions

Concurrency : 500,000 concurrently connected users; assume a much smaller fraction (e.g. 10–30%) are actively executing code at any instant — the rest are idle-connected.
Resume SLA : p95 resume latency < 5 s for "typical" (recently suspended, common-image) workspaces; cold-start fallbacks may be slower but must remain available.
Durability : notebook files and explicitly saved state must persist indefinitely across suspend/resume and node loss; in-memory kernel state is best-effort.
Isolation : users run arbitrary, untrusted code — multi-tenant isolation and noisy-neighbor protection are hard requirements.
Cost : idle workspaces vastly outnumber active ones; idle cost must be near-zero (no dedicated reserved compute per idle user).
Assume a global user base; multi-region is desirable but you may scope availability targets (zone vs. region) explicitly.

Clarifying Questions to Ask

Do workspaces need GPUs/accelerators , or is this CPU-only? Does hardware type vary per workspace and affect placement?
Must in-memory kernel state (variables, loaded models) survive suspend, or only files on disk? This decides whether we need memory snapshots at all.
What's the idle policy — do we auto-suspend after N minutes of inactivity, and is that user-configurable or tier-based?
Are there free vs. paid tiers with different resource limits, isolation guarantees, or resume priorities?
What are the availability and durability targets (single-zone, multi-zone, multi-region; RPO/RTO for user data)?
What are the package/customization rules — fixed curated images, or arbitrary pip install that mutates the environment we must persist?

What a Strong Answer Covers

Separation of responsibilities : a clear boundary between the component that decides what should happen to a workspace and the component that runs user code, and why that boundary makes the rest of the design tractable.
Lifecycle correctness : an explicit state model for create/suspend/resume/delete that stays correct under retries, duplicate events, and worker crashes mid-transition.
The resume budget : a reasoned account of where the <5 s goes and how the common path avoids the slow parts of a cold boot, with explicit fallbacks for when the fast path is unavailable and an honest latency breakdown.
Durability vs. compute statefulness : a defensible stance on what must survive (files, saved state) versus what is best-effort (in-memory state), and how the storage design guarantees the former independent of which machine runs the code.
Isolation for untrusted code : a justified isolation boundary and noisy-neighbor controls, with the trade-off between strength of isolation and overhead/start-up cost made explicit.
Scaling strategy : how the architecture absorbs 500K users without a single global bottleneck or blast radius, and how it exploits the idle-heavy workload to control cost.
Failure handling : graceful degradation across node / zone / region / metadata / queue / capacity / registry failures — shedding speed or in-memory state rather than data or availability.
Observability : the SLO metrics and alerts that would actually tell you the resume SLA is at risk before users feel it.
The central trade-off : explicit articulation of the tension between resume speed, scalability, and idle cost, and where this design lands on it.

Follow-up Questions

Your warm pool is exhausted during a regional demand spike (e.g. a viral course assignment). What degrades, in what order, and how do you protect the resume SLA for whoever you can?
A user runs pip install that mutates the environment, then suspends. How do you guarantee that mutation survives resume without forcing a full cold rebuild every time?
How would you support GPU workspaces given that warm-pooling expensive accelerators is far more costly than CPU — does your resume strategy change?
How do you prevent a malicious user from escaping the sandbox or exhausting a node, and how do you detect and contain it at 500K scale?

Design a Hosted Notebook Platform

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP

Design a Hosted Notebook Platform

Quick Overview

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Solution

Submit Your Answer to Earn 20XP