Design a sandboxed cloud IDE
Company: OpenAI
Role: Software Engineer
Category: System Design
Difficulty: easy
Interview Round: Onsite
## System design: Sandboxed cloud IDE (Colab-like)
Design a **multi-tenant, browser-based cloud IDE/notebook** that lets users run code in an isolated sandbox (similar to hosted notebooks).
### Core user experience
- User opens a workspace (project/notebook), edits code in the browser, and runs cells/commands.
- Output appears in the UI (stdout/stderr, rich output).
- Users can view **streaming logs** while code runs.
### Requirements
**Functional**
- Provision an isolated compute environment per workspace/session.
- Execute arbitrary user code safely (sandboxing).
- Stream execution output/logs to the browser in near real time.
- Support basic file operations (upload/download, persisted workspace state).
- Basic collaboration is optional (call out if you include it).
**Non-functional**
- Strong isolation between tenants (security is primary).
- Reasonable startup latency for a new session.
- Support autoscaling and fair resource sharing.
- Observability: metrics, tracing, audit logs.
### Focus areas to cover
- How you choose and manage the compute substrate (VMs vs containers vs microVMs).
- Isolation model (filesystem, network, process, credentials).
- Log/output streaming architecture.
- Lifecycle management: create, run, idle, suspend/resume, terminate.
- Data persistence strategy (workspace files, checkpoints).
State assumptions and provide an API sketch and high-level architecture diagram description.
Quick Answer: This question evaluates a candidate's competency in designing multi-tenant distributed systems with strong isolation and sandboxing, runtime lifecycle management, real-time output/log streaming, data persistence, autoscaling, and observability.
Solution
### 1) Clarify scope + assumptions
I’ll design a Colab-like **single-user per session** environment first (collab optional), focusing on:
- Untrusted code execution
- VM/microVM lifecycle management
- Streaming logs/output
- Persisted workspace files
Assumptions:
- Workloads are bursty, mostly Python/JS/etc.
- Sessions may idle; we want suspend/stop to save cost.
- We can run on Kubernetes + a VM/microVM layer.
---
### 2) High-level architecture
**Frontend (Web IDE)**
- Editor + terminal/notebook UI
- WebSocket connection for output streaming
- Auth tokens for session access
**Control plane (multi-tenant services)**
1. **AuthN/AuthZ**: user identity, org/workspace permissions
2. **Workspace Service**: metadata (workspace id, owner, config, last state)
3. **Session Manager / Orchestrator**: creates/attaches to runtime, enforces quotas, manages lifecycle
4. **Scheduler / Capacity Manager**: chooses cluster/host pool based on GPU/CPU needs, locality, quotas
5. **Policy Engine**: allowed images, network egress rules, file access rules
**Data plane (per-session runtime)**
- **Sandbox runtime**: microVM/VM/container running user code and an “agent”
- **Runtime Agent**: executes commands/cells, streams logs, manages files, reports health
- **Log/Output pipeline**: agent → streaming gateway → client
- **Storage**: object store for artifacts + network file system or volume snapshots for workspace
---
### 3) Compute isolation choice (VM vs container vs microVM)
**Security-first** suggests avoiding plain containers for untrusted code unless heavily hardened.
Options:
- **Containers (K8s pods)**: fast start, good density, but weaker isolation; requires gVisor/Kata and strict seccomp/AppArmor.
- **Full VMs**: strongest isolation, slower start, heavier.
- **MicroVMs (Firecracker/Kata)**: strong isolation close to VMs with faster startup and higher density.
Recommended:
- **MicroVM-based runtimes** for default untrusted execution.
- Keep a container-based “trusted mode” only for internal/enterprise controlled workloads.
Operationally:
- Run microVMs on a fleet of hosts (possibly managed by K8s with a VM runtime class, or a dedicated microVM manager).
---
### 4) Session lifecycle management
Define states:
- **Provisioning → Running → Idle → Suspended → Terminated**
Key mechanisms:
- **Warm pools**: keep a pool of pre-initialized microVMs/images to reduce cold start.
- **Idle detection**: no active websocket + low CPU for N minutes → transition to Idle.
- **Suspend**: snapshot memory/disk (or at least disk) and stop CPU to save cost.
- **Resume**: restore snapshot or rehydrate from workspace storage.
Policies:
- Per-user/org quotas (CPU cores, memory, GPU count, max concurrent sessions).
- Hard time limits for free tier.
---
### 5) Storage and persistence
You need two kinds of persistence:
1. **Workspace files** (source code, notebooks)
2. **Runtime ephemeral state** (installed packages, caches)
A practical approach:
- Workspace files stored in a **versioned object store** (and/or git-backed) with periodic checkpoints.
- Runtime disk:
- Base image + writable overlay.
- Persist overlay to a **per-workspace volume** (network volume) or periodic snapshots to object store.
Tradeoffs:
- Network volumes simplify resume but can bottleneck.
- Snapshots reduce steady-state cost but increase resume time.
For interviews: propose **object store for durable workspace** + **local ephemeral disk** + periodic checkpointing; optionally offer paid tier with persistent volumes.
---
### 6) Secure sandboxing model
Threats: data exfiltration, lateral movement, crypto-mining, container escapes, abusing metadata endpoints.
Controls:
- **Network isolation**
- Default-deny egress; allowlist common package repos via proxy.
- Block access to cloud metadata IPs.
- Per-session NAT and firewall rules.
- **Identity isolation**
- Short-lived credentials; no node-level credentials inside runtime.
- If access to internal resources is needed, use scoped tokens + audit.
- **OS hardening**
- Read-only root FS where possible.
- seccomp/AppArmor, drop Linux capabilities.
- Kernel isolation via microVM.
- **Resource controls**
- cgroups quotas, CPU throttling, memory limits, disk quotas.
- Detect abusive patterns (e.g., sustained high CPU) and rate-limit/terminate.
- **Supply chain**
- Signed base images, restricted package installation path (through proxy with scanning), optional malware scanning.
---
### 7) Execution model (notebook/terminal)
Inside each runtime run a **Runtime Agent** that exposes APIs:
- `POST /execute` (cell/command)
- `POST /interrupt` (send SIGINT)
- `GET /status` (kernel health)
- `GET /fs/*` / `PUT /fs/*` (file operations)
For notebooks, use a kernel protocol (e.g., Jupyter-like) internally, but keep it abstract in the design.
---
### 8) Streaming logs/output
Goal: near-real-time stdout/stderr and structured events (cell started/finished, exit codes).
Design:
1. Runtime Agent writes output to a **local ring buffer** + emits events.
2. Agent streams to a **Streaming Gateway** over gRPC/WebSocket.
3. Gateway fan-outs to the browser via **WebSocket**.
4. Also persist logs asynchronously to a log store (for debugging/audit).
Key details:
- **Backpressure**: if client is slow/disconnected, buffer up to X MB then truncate with “output truncated” markers.
- **Reconnect**: client provides `last_event_id`; gateway replays from buffer/log store.
- **Multiplexing**: one websocket per session, channels for stdout/stderr/events.
Why a gateway?
- Avoid exposing runtimes directly to the internet.
- Central place for auth, rate limiting, and protocol translation.
---
### 9) Control plane APIs (sketch)
- `POST /workspaces` create workspace
- `POST /workspaces/{id}/sessions` create/attach runtime (spec: cpu/mem/gpu, image)
- `GET /sessions/{sid}` status, endpoint info
- `POST /sessions/{sid}:terminate`
- `POST /sessions/{sid}:suspend` / `:resume`
Streaming:
- `GET wss://.../sessions/{sid}/stream?token=...`
---
### 10) Scheduling, autoscaling, and quotas
- **Bin-pack** sessions by resource (CPU/mem/GPU) on hosts.
- Use **cluster autoscaler** for host pools.
- Separate pools: CPU-only, GPU, high-mem.
- Enforce quotas at admission time in Session Manager.
Fairness:
- Weighted fair sharing per org.
- Preemption for free tier when capacity tight.
---
### 11) Observability + operations
Metrics:
- Session start latency (p50/p95)
- Running sessions, idle sessions, suspend/resume rates
- Output streaming lag, dropped messages
- Host utilization, noisy-neighbor incidents
Logging:
- Audit: who started sessions, image used, network policy applied
- Security signals: denied egress attempts, suspicious syscalls (if instrumented)
Tracing:
- Request path: create session → scheduler → provisioner → agent ready
---
### 12) Common pitfalls / edge cases
- **Cold start** too slow → warm pools, image caching, smaller base images.
- **User installs huge deps** → disk quotas + caching layers.
- **Infinite output loops** → output caps + server-side rate limiting.
- **Disconnected clients** → continue running? policy choice; many systems keep running with idle timer.
- **Secrets leakage** → never mount broad cloud credentials; use scoped per-session tokens.
---
### 13) If collaboration is requested (optional extension)
- Separate “document state” (CRDT/OT) from runtime execution.
- One shared runtime per collaborative notebook is hard (contention, permissions); consider per-user runtimes with shared filesystem or a single owner-exec model.
This end-to-end design addresses sandbox security, VM/microVM management, and reliable log streaming while keeping the system operable at scale.