PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/OpenAI

Design a Hosted Notebook Platform

Last updated: Jun 17, 2026

Quick Overview

This question evaluates system design skills around stateful compute orchestration, including control-plane versus data-plane separation, workspace lifecycle management, fast resume latencies, and large-scale partitioning for a hosted interactive notebook platform.

  • medium
  • OpenAI
  • System Design
  • Software Engineer

Design a Hosted Notebook Platform

Company: OpenAI

Role: Software Engineer

Category: System Design

Difficulty: medium

Interview Round: Technical Screen

Design a **hosted notebook platform** for interactive code execution — a cloud-based notebook service in the spirit of Google Colab, Deepnote, or Hex — where each user runs code in their browser against a live, isolated backend kernel. Each user manages an isolated **workspace**, which bundles the user's **runtime environment** (kernel, installed packages, in-memory process state) and their **persisted notebook files**. Your design must support the following lifecycle operations: - **Create** a workspace - **Delete** a workspace - **Suspend** a workspace (release expensive compute while preserving state) - **Resume** a suspended workspace (bring it back to a usable state) The platform must support **500,000 concurrent connected users**, and a **Resume** must complete in **under 5 seconds** for typical requests. Notebook files and any saved state must **survive** a suspend/resume cycle. Walk through your design end-to-end: the high-level architecture, the split between control plane and runtime execution, the workspace lifecycle state machine, how you specifically hit the sub-5-second resume target, how you scale to 500K concurrent users cost-effectively, and the major failure modes with their mitigations. ```hint Where to start This is a stateful-compute orchestration problem. Separate the **control plane** (metadata, scheduling, lifecycle state machine, routing) from the **data plane** (the sandboxes that actually run user code). Almost every later decision falls out of this split. ``` ```hint The core latency lever "Resume in <5s" rules out a full cold boot (image pull + container/microVM start + kernel init + dependency load can be tens of seconds). Ask what work you could do *before* the resume request arrives instead of on the critical path — what state can you pre-materialize or keep warm so resume becomes a fast attach rather than a rebuild? ``` ```hint Decouple compute from data If user files live on the runtime's local disk, every resume must rebuild them. Keep notebooks and data on **network-attached / distributed persistent storage** so resume only has to restore compute, not re-fetch the user's world. This also makes node failure recoverable. ``` ```hint Don't build one giant cluster 500K concurrent is as much a blast-radius and scheduling problem as a raw-capacity one — think about how to partition it so one bad shard can't take down everyone, and how lifecycle transitions stay correct under retries and crashes. Also question the word "concurrent": how many of those users are actually executing code versus connected-but-idle, and what does that let you do? ``` ### Constraints & Assumptions - **Concurrency**: 500,000 concurrently *connected* users; assume a much smaller fraction (e.g. 10–30%) are actively executing code at any instant — the rest are idle-connected. - **Resume SLA**: p95 resume latency < 5 s for "typical" (recently suspended, common-image) workspaces; cold-start fallbacks may be slower but must remain available. - **Durability**: notebook files and explicitly saved state must persist indefinitely across suspend/resume and node loss; in-memory kernel state is best-effort. - **Isolation**: users run arbitrary, untrusted code — multi-tenant isolation and noisy-neighbor protection are hard requirements. - **Cost**: idle workspaces vastly outnumber active ones; idle cost must be near-zero (no dedicated reserved compute per idle user). - Assume a global user base; multi-region is desirable but you may scope availability targets (zone vs. region) explicitly. ### Clarifying Questions to Ask - Do workspaces need **GPUs/accelerators**, or is this CPU-only? Does hardware type vary per workspace and affect placement? - Must **in-memory kernel state** (variables, loaded models) survive suspend, or only files on disk? This decides whether we need memory snapshots at all. - What's the **idle policy** — do we auto-suspend after N minutes of inactivity, and is that user-configurable or tier-based? - Are there **free vs. paid tiers** with different resource limits, isolation guarantees, or resume priorities? - What are the **availability and durability targets** (single-zone, multi-zone, multi-region; RPO/RTO for user data)? - What are the **package/customization** rules — fixed curated images, or arbitrary `pip install` that mutates the environment we must persist? ### What a Strong Answer Covers This is a rubric of *dimensions a strong answer should address* — not a checklist of specific mechanisms. A senior candidate is judged on whether they reason about each dimension and justify their choices, not on naming a particular technology. - **Separation of responsibilities**: a clear boundary between the component that *decides* what should happen to a workspace and the component that *runs* user code, and why that boundary makes the rest of the design tractable. - **Lifecycle correctness**: an explicit state model for create/suspend/resume/delete that stays correct under retries, duplicate events, and worker crashes mid-transition. - **The resume budget**: a reasoned account of where the <5 s goes and how the common path avoids the slow parts of a cold boot, with explicit fallbacks for when the fast path is unavailable and an honest latency breakdown. - **Durability vs. compute statefulness**: a defensible stance on what must survive (files, saved state) versus what is best-effort (in-memory state), and how the storage design guarantees the former independent of which machine runs the code. - **Isolation for untrusted code**: a justified isolation boundary and noisy-neighbor controls, with the trade-off between strength of isolation and overhead/start-up cost made explicit. - **Scaling strategy**: how the architecture absorbs 500K users without a single global bottleneck or blast radius, and how it exploits the idle-heavy workload to control cost. - **Failure handling**: graceful degradation across node / zone / region / metadata / queue / capacity / registry failures — shedding speed or in-memory state rather than data or availability. - **Observability**: the SLO metrics and alerts that would actually tell you the resume SLA is at risk before users feel it. - **The central trade-off**: explicit articulation of the tension between resume speed, scalability, and idle cost, and where this design lands on it. ### Follow-up Questions - Your warm pool is exhausted during a regional demand spike (e.g. a viral course assignment). What degrades, in what order, and how do you protect the resume SLA for whoever you can? - A user runs `pip install` that mutates the environment, then suspends. How do you guarantee that mutation survives resume without forcing a full cold rebuild every time? - How would you support **GPU** workspaces given that warm-pooling expensive accelerators is far more costly than CPU — does your resume strategy change? - How do you prevent a malicious user from escaping the sandbox or exhausting a node, and how do you detect and contain it at 500K scale?

Quick Answer: This question evaluates system design skills around stateful compute orchestration, including control-plane versus data-plane separation, workspace lifecycle management, fast resume latencies, and large-scale partitioning for a hosted interactive notebook platform.

Related Interview Questions

  • Design Video Generation Orchestration - OpenAI (medium)
  • Design CI/CD Build Caching - OpenAI
  • Design an Instagram-like Feed System - OpenAI (medium)
  • Design Online Chess Matchmaking - OpenAI (hard)
  • Design Android MVVM API Architecture - OpenAI (medium)
OpenAI logo
OpenAI
Apr 12, 2026, 12:00 AM
Software Engineer
Technical Screen
System Design
286
0
Loading...

Design a hosted notebook platform for interactive code execution — a cloud-based notebook service in the spirit of Google Colab, Deepnote, or Hex — where each user runs code in their browser against a live, isolated backend kernel.

Each user manages an isolated workspace, which bundles the user's runtime environment (kernel, installed packages, in-memory process state) and their persisted notebook files. Your design must support the following lifecycle operations:

  • Create a workspace
  • Delete a workspace
  • Suspend a workspace (release expensive compute while preserving state)
  • Resume a suspended workspace (bring it back to a usable state)

The platform must support 500,000 concurrent connected users, and a Resume must complete in under 5 seconds for typical requests. Notebook files and any saved state must survive a suspend/resume cycle.

Walk through your design end-to-end: the high-level architecture, the split between control plane and runtime execution, the workspace lifecycle state machine, how you specifically hit the sub-5-second resume target, how you scale to 500K concurrent users cost-effectively, and the major failure modes with their mitigations.

Constraints & Assumptions

  • Concurrency : 500,000 concurrently connected users; assume a much smaller fraction (e.g. 10–30%) are actively executing code at any instant — the rest are idle-connected.
  • Resume SLA : p95 resume latency < 5 s for "typical" (recently suspended, common-image) workspaces; cold-start fallbacks may be slower but must remain available.
  • Durability : notebook files and explicitly saved state must persist indefinitely across suspend/resume and node loss; in-memory kernel state is best-effort.
  • Isolation : users run arbitrary, untrusted code — multi-tenant isolation and noisy-neighbor protection are hard requirements.
  • Cost : idle workspaces vastly outnumber active ones; idle cost must be near-zero (no dedicated reserved compute per idle user).
  • Assume a global user base; multi-region is desirable but you may scope availability targets (zone vs. region) explicitly.

Clarifying Questions to Ask

  • Do workspaces need GPUs/accelerators , or is this CPU-only? Does hardware type vary per workspace and affect placement?
  • Must in-memory kernel state (variables, loaded models) survive suspend, or only files on disk? This decides whether we need memory snapshots at all.
  • What's the idle policy — do we auto-suspend after N minutes of inactivity, and is that user-configurable or tier-based?
  • Are there free vs. paid tiers with different resource limits, isolation guarantees, or resume priorities?
  • What are the availability and durability targets (single-zone, multi-zone, multi-region; RPO/RTO for user data)?
  • What are the package/customization rules — fixed curated images, or arbitrary pip install that mutates the environment we must persist?

What a Strong Answer Covers

This is a rubric of dimensions a strong answer should address — not a checklist of specific mechanisms. A senior candidate is judged on whether they reason about each dimension and justify their choices, not on naming a particular technology.

  • Separation of responsibilities : a clear boundary between the component that decides what should happen to a workspace and the component that runs user code, and why that boundary makes the rest of the design tractable.
  • Lifecycle correctness : an explicit state model for create/suspend/resume/delete that stays correct under retries, duplicate events, and worker crashes mid-transition.
  • The resume budget : a reasoned account of where the <5 s goes and how the common path avoids the slow parts of a cold boot, with explicit fallbacks for when the fast path is unavailable and an honest latency breakdown.
  • Durability vs. compute statefulness : a defensible stance on what must survive (files, saved state) versus what is best-effort (in-memory state), and how the storage design guarantees the former independent of which machine runs the code.
  • Isolation for untrusted code : a justified isolation boundary and noisy-neighbor controls, with the trade-off between strength of isolation and overhead/start-up cost made explicit.
  • Scaling strategy : how the architecture absorbs 500K users without a single global bottleneck or blast radius, and how it exploits the idle-heavy workload to control cost.
  • Failure handling : graceful degradation across node / zone / region / metadata / queue / capacity / registry failures — shedding speed or in-memory state rather than data or availability.
  • Observability : the SLO metrics and alerts that would actually tell you the resume SLA is at risk before users feel it.
  • The central trade-off : explicit articulation of the tension between resume speed, scalability, and idle cost, and where this design lands on it.

Follow-up Questions

  • Your warm pool is exhausted during a regional demand spike (e.g. a viral course assignment). What degrades, in what order, and how do you protect the resume SLA for whoever you can?
  • A user runs pip install that mutates the environment, then suspends. How do you guarantee that mutation survives resume without forcing a full cold rebuild every time?
  • How would you support GPU workspaces given that warm-pooling expensive accelerators is far more costly than CPU — does your resume strategy change?
  • How do you prevent a malicious user from escaping the sandbox or exhausting a node, and how do you detect and contain it at 500K scale?

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More OpenAI•More Software Engineer•OpenAI Software Engineer•OpenAI System Design•Software Engineer System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.