PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/OpenAI

Design a sandboxed cloud IDE

Last updated: Apr 9, 2026

Quick Overview

This question evaluates a candidate's competency in designing multi-tenant distributed systems with strong isolation and sandboxing, runtime lifecycle management, real-time output/log streaming, data persistence, autoscaling, and observability.

  • easy
  • OpenAI
  • System Design
  • Software Engineer

Design a sandboxed cloud IDE

Company: OpenAI

Role: Software Engineer

Category: System Design

Difficulty: easy

Interview Round: Onsite

## System design: Sandboxed cloud IDE (Colab-like) Design a **multi-tenant, browser-based cloud IDE/notebook** that lets users run code in an isolated sandbox (similar to hosted notebooks). ### Core user experience - User opens a workspace (project/notebook), edits code in the browser, and runs cells/commands. - Output appears in the UI (stdout/stderr, rich output). - Users can view **streaming logs** while code runs. ### Requirements **Functional** - Provision an isolated compute environment per workspace/session. - Execute arbitrary user code safely (sandboxing). - Stream execution output/logs to the browser in near real time. - Support basic file operations (upload/download, persisted workspace state). - Basic collaboration is optional (call out if you include it). **Non-functional** - Strong isolation between tenants (security is primary). - Reasonable startup latency for a new session. - Support autoscaling and fair resource sharing. - Observability: metrics, tracing, audit logs. ### Focus areas to cover - How you choose and manage the compute substrate (VMs vs containers vs microVMs). - Isolation model (filesystem, network, process, credentials). - Log/output streaming architecture. - Lifecycle management: create, run, idle, suspend/resume, terminate. - Data persistence strategy (workspace files, checkpoints). State assumptions and provide an API sketch and high-level architecture diagram description.

Quick Answer: This question evaluates a candidate's competency in designing multi-tenant distributed systems with strong isolation and sandboxing, runtime lifecycle management, real-time output/log streaming, data persistence, autoscaling, and observability.

Solution

### 1) Clarify scope + assumptions I’ll design a Colab-like **single-user per session** environment first (collab optional), focusing on: - Untrusted code execution - VM/microVM lifecycle management - Streaming logs/output - Persisted workspace files Assumptions: - Workloads are bursty, mostly Python/JS/etc. - Sessions may idle; we want suspend/stop to save cost. - We can run on Kubernetes + a VM/microVM layer. --- ### 2) High-level architecture **Frontend (Web IDE)** - Editor + terminal/notebook UI - WebSocket connection for output streaming - Auth tokens for session access **Control plane (multi-tenant services)** 1. **AuthN/AuthZ**: user identity, org/workspace permissions 2. **Workspace Service**: metadata (workspace id, owner, config, last state) 3. **Session Manager / Orchestrator**: creates/attaches to runtime, enforces quotas, manages lifecycle 4. **Scheduler / Capacity Manager**: chooses cluster/host pool based on GPU/CPU needs, locality, quotas 5. **Policy Engine**: allowed images, network egress rules, file access rules **Data plane (per-session runtime)** - **Sandbox runtime**: microVM/VM/container running user code and an “agent” - **Runtime Agent**: executes commands/cells, streams logs, manages files, reports health - **Log/Output pipeline**: agent → streaming gateway → client - **Storage**: object store for artifacts + network file system or volume snapshots for workspace --- ### 3) Compute isolation choice (VM vs container vs microVM) **Security-first** suggests avoiding plain containers for untrusted code unless heavily hardened. Options: - **Containers (K8s pods)**: fast start, good density, but weaker isolation; requires gVisor/Kata and strict seccomp/AppArmor. - **Full VMs**: strongest isolation, slower start, heavier. - **MicroVMs (Firecracker/Kata)**: strong isolation close to VMs with faster startup and higher density. Recommended: - **MicroVM-based runtimes** for default untrusted execution. - Keep a container-based “trusted mode” only for internal/enterprise controlled workloads. Operationally: - Run microVMs on a fleet of hosts (possibly managed by K8s with a VM runtime class, or a dedicated microVM manager). --- ### 4) Session lifecycle management Define states: - **Provisioning → Running → Idle → Suspended → Terminated** Key mechanisms: - **Warm pools**: keep a pool of pre-initialized microVMs/images to reduce cold start. - **Idle detection**: no active websocket + low CPU for N minutes → transition to Idle. - **Suspend**: snapshot memory/disk (or at least disk) and stop CPU to save cost. - **Resume**: restore snapshot or rehydrate from workspace storage. Policies: - Per-user/org quotas (CPU cores, memory, GPU count, max concurrent sessions). - Hard time limits for free tier. --- ### 5) Storage and persistence You need two kinds of persistence: 1. **Workspace files** (source code, notebooks) 2. **Runtime ephemeral state** (installed packages, caches) A practical approach: - Workspace files stored in a **versioned object store** (and/or git-backed) with periodic checkpoints. - Runtime disk: - Base image + writable overlay. - Persist overlay to a **per-workspace volume** (network volume) or periodic snapshots to object store. Tradeoffs: - Network volumes simplify resume but can bottleneck. - Snapshots reduce steady-state cost but increase resume time. For interviews: propose **object store for durable workspace** + **local ephemeral disk** + periodic checkpointing; optionally offer paid tier with persistent volumes. --- ### 6) Secure sandboxing model Threats: data exfiltration, lateral movement, crypto-mining, container escapes, abusing metadata endpoints. Controls: - **Network isolation** - Default-deny egress; allowlist common package repos via proxy. - Block access to cloud metadata IPs. - Per-session NAT and firewall rules. - **Identity isolation** - Short-lived credentials; no node-level credentials inside runtime. - If access to internal resources is needed, use scoped tokens + audit. - **OS hardening** - Read-only root FS where possible. - seccomp/AppArmor, drop Linux capabilities. - Kernel isolation via microVM. - **Resource controls** - cgroups quotas, CPU throttling, memory limits, disk quotas. - Detect abusive patterns (e.g., sustained high CPU) and rate-limit/terminate. - **Supply chain** - Signed base images, restricted package installation path (through proxy with scanning), optional malware scanning. --- ### 7) Execution model (notebook/terminal) Inside each runtime run a **Runtime Agent** that exposes APIs: - `POST /execute` (cell/command) - `POST /interrupt` (send SIGINT) - `GET /status` (kernel health) - `GET /fs/*` / `PUT /fs/*` (file operations) For notebooks, use a kernel protocol (e.g., Jupyter-like) internally, but keep it abstract in the design. --- ### 8) Streaming logs/output Goal: near-real-time stdout/stderr and structured events (cell started/finished, exit codes). Design: 1. Runtime Agent writes output to a **local ring buffer** + emits events. 2. Agent streams to a **Streaming Gateway** over gRPC/WebSocket. 3. Gateway fan-outs to the browser via **WebSocket**. 4. Also persist logs asynchronously to a log store (for debugging/audit). Key details: - **Backpressure**: if client is slow/disconnected, buffer up to X MB then truncate with “output truncated” markers. - **Reconnect**: client provides `last_event_id`; gateway replays from buffer/log store. - **Multiplexing**: one websocket per session, channels for stdout/stderr/events. Why a gateway? - Avoid exposing runtimes directly to the internet. - Central place for auth, rate limiting, and protocol translation. --- ### 9) Control plane APIs (sketch) - `POST /workspaces` create workspace - `POST /workspaces/{id}/sessions` create/attach runtime (spec: cpu/mem/gpu, image) - `GET /sessions/{sid}` status, endpoint info - `POST /sessions/{sid}:terminate` - `POST /sessions/{sid}:suspend` / `:resume` Streaming: - `GET wss://.../sessions/{sid}/stream?token=...` --- ### 10) Scheduling, autoscaling, and quotas - **Bin-pack** sessions by resource (CPU/mem/GPU) on hosts. - Use **cluster autoscaler** for host pools. - Separate pools: CPU-only, GPU, high-mem. - Enforce quotas at admission time in Session Manager. Fairness: - Weighted fair sharing per org. - Preemption for free tier when capacity tight. --- ### 11) Observability + operations Metrics: - Session start latency (p50/p95) - Running sessions, idle sessions, suspend/resume rates - Output streaming lag, dropped messages - Host utilization, noisy-neighbor incidents Logging: - Audit: who started sessions, image used, network policy applied - Security signals: denied egress attempts, suspicious syscalls (if instrumented) Tracing: - Request path: create session → scheduler → provisioner → agent ready --- ### 12) Common pitfalls / edge cases - **Cold start** too slow → warm pools, image caching, smaller base images. - **User installs huge deps** → disk quotas + caching layers. - **Infinite output loops** → output caps + server-side rate limiting. - **Disconnected clients** → continue running? policy choice; many systems keep running with idle timer. - **Secrets leakage** → never mount broad cloud credentials; use scoped per-session tokens. --- ### 13) If collaboration is requested (optional extension) - Separate “document state” (CRDT/OT) from runtime execution. - One shared runtime per collaborative notebook is hard (contention, permissions); consider per-user runtimes with shared filesystem or a single owner-exec model. This end-to-end design addresses sandbox security, VM/microVM management, and reliable log streaming while keeping the system operable at scale.

Related Interview Questions

  • Design an Instagram-like Feed System - OpenAI (medium)
  • Design Online Chess Matchmaking - OpenAI (hard)
  • Design a Distributed Crossword Solver - OpenAI (hard)
  • Design a Distributed Rate Limiter - OpenAI
  • Design a Distributed Crossword Solver - OpenAI (medium)
OpenAI logo
OpenAI
Jan 22, 2026, 12:00 AM
Software Engineer
Onsite
System Design
310
0

System design: Sandboxed cloud IDE (Colab-like)

Design a multi-tenant, browser-based cloud IDE/notebook that lets users run code in an isolated sandbox (similar to hosted notebooks).

Core user experience

  • User opens a workspace (project/notebook), edits code in the browser, and runs cells/commands.
  • Output appears in the UI (stdout/stderr, rich output).
  • Users can view streaming logs while code runs.

Requirements

Functional

  • Provision an isolated compute environment per workspace/session.
  • Execute arbitrary user code safely (sandboxing).
  • Stream execution output/logs to the browser in near real time.
  • Support basic file operations (upload/download, persisted workspace state).
  • Basic collaboration is optional (call out if you include it).

Non-functional

  • Strong isolation between tenants (security is primary).
  • Reasonable startup latency for a new session.
  • Support autoscaling and fair resource sharing.
  • Observability: metrics, tracing, audit logs.

Focus areas to cover

  • How you choose and manage the compute substrate (VMs vs containers vs microVMs).
  • Isolation model (filesystem, network, process, credentials).
  • Log/output streaming architecture.
  • Lifecycle management: create, run, idle, suspend/resume, terminate.
  • Data persistence strategy (workspace files, checkpoints).

State assumptions and provide an API sketch and high-level architecture diagram description.

Solution

Show

Submit Your Answer

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More OpenAI•More Software Engineer•OpenAI Software Engineer•OpenAI System Design•Software Engineer System Design
PracHub

Master your tech interviews with 8,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.