Design a Cloud DevBox Platform
Company: OpenAI
Role: Software Engineer
Category: System Design
Difficulty: hard
Interview Round: Technical Screen
Design a cloud **DevBox** platform: a service that gives developers disposable or persistent remote development machines, accessible through a browser, SSH, or IDE plugins (e.g. a VS Code / JetBrains remote tunnel).
The platform should support the full lifecycle of a dev box and the operational needs of an organization:
- **Lifecycle**: users can create, start, stop, pause, resume, and delete dev boxes from predefined templates.
- **Box contents**: each dev box bundles an operating system image, language/toolchain dependencies, a repository checkout, a user home directory, injected secrets, and allocated CPU, memory, storage, and optional GPU.
- **Multi-tenancy**: organizations and teams, quota enforcement, role-based access control (RBAC), audit logs, and cost controls.
- **Quality bar**: optimize for fast provisioning, high availability, scalability to many concurrent dev boxes, strong isolation, reliable persistence, and operational observability.
Walk through requirements, the high-level architecture, the lifecycle/state model, scheduling and provisioning, isolation, storage/persistence, failure handling, and observability/cost. Call out the major tradeoffs and where the system is actually bottlenecked.
```hint Frame the split first
The single most important structural decision is a **control plane / data plane split**. The control plane (API, RBAC, scheduler, metadata DB) makes decisions and holds durable state; the data plane (compute nodes, runtimes, the connection gateway) runs the boxes and carries traffic. Aim for a design where a control-plane outage blocks only *new* mutations and never kills *running* boxes or live sessions.
```
```hint Model lifecycle as desired vs actual state
Don't hand-code each transition imperatively. Give every box a `desired_state` (set by the API) and an `actual_state`, and have an idempotent, leased **reconciler** drive `actual → desired`. This is what makes the system self-healing and safe under retries, duplicate events, and crashed workers. Be precise about **stop** (release compute, keep the workspace disk, cold boot to resume) vs **pause** (snapshot RAM + disk, warm resume with processes intact).
```
```hint Provisioning latency is the hard NFR
"Fast provisioning" is where the real engineering hides, and the win comes from *not doing work on the critical path* — pre-doing it, sharing it, or restoring it instead of rebuilding it. Decompose a cold create into its serial costs (boot, image fetch, dependency build, workspace attach) and attack whichever dominates. Identify the actual bottleneck before optimizing — it is *not* the small metadata DB.
```
```hint Let the trust model force the isolation choice
Pin down the trust model early. Untrusted, arbitrary user code points you at a hardware boundary (**microVM** / VM), not a shared-kernel container. This one fork shapes the runtime, the node fleet, and the entire security posture, so state it explicitly.
```
### Constraints & Assumptions
State these explicitly; they size the design. Reasonable defaults if the interviewer doesn't specify:
- **Workloads are untrusted** — boxes run arbitrary, possibly hostile user code, so isolation must be a real kernel/hardware boundary.
- **Pause/resume (RAM snapshot) is in scope**, not just stop/start.
- **Multi-region** deployment for latency and resilience.
- **Scale**: on the order of $10^5$ total boxes (e.g. ~200k), of which a smaller fraction (~10–20%) are running concurrently at peak; thousands of organizations.
- **Provisioning SLO targets**: warm resume in single-digit seconds; cold create of a common template in tens of seconds, not minutes.
- **Availability target**: control-plane HA; a control-plane outage must not terminate running boxes or live sessions.
- **Identity**: an external OIDC/SSO identity provider already exists and is consumed (not designed here).
- **Out of scope**: building the in-box IDE itself, and the billing/invoicing system (you emit metering events that feed it, not invoices).
### Clarifying Questions to Ask
- **Trust model**: are workloads untrusted (arbitrary, possibly hostile code) or trusted internal developers? This is the single biggest design fork.
- Is **pause/resume** (RAM snapshot) required, or only stop/start?
- What is the expected **steady-state concurrency** and the **create rate** (including peak bursts like a Monday-morning login spike)?
- **Single-region or multi-region** for v1?
- Are boxes mostly **persistent** (survive node failure) or **ephemeral/scratch**, and what is the acceptable data-loss bound for each?
- Is there a **GPU** requirement, and what fraction of boxes need it?
### What a Strong Answer Covers
- A clean **control-plane / data-plane separation**, with an explicit statement of what keeps working during a control-plane outage.
- A **desired-state lifecycle model** with a reconciler, leasing/idempotency, and the **stop-vs-pause** distinction handled correctly (including transient states and `FAILED`).
- A **cold-start / provisioning** strategy (warm pools, COW images, image caches, prebuilds, RAM-snapshot resume) tied to the latency SLO.
- A defensible **isolation choice** (microVM vs container) justified by the stated trust model, plus secrets, network, and egress controls.
- A coherent **storage/persistence** model that separates immutable environment data from mutable user data and survives node failure.
- **Multi-tenancy** mechanics: org/team/RBAC, quotas, audit logs, idempotency, and async lifecycle APIs.
- **Scheduling & placement**: filter-then-score with image-cache affinity and tenant anti-affinity.
- **Failure handling**: node loss, provisioning failure, gateway failure, duplicate/out-of-order events.
- **Observability and cost controls** (provisioning-latency metrics, per-tenant metering, idle auto-stop/pause, budget caps).
- Honest identification of where the system is **bottlenecked** (compute capacity, image-pull bandwidth, snapshot/attach latency — not the small metadata DB) and where the optimization effort goes.
### Follow-up Questions
- How do you implement **pause/resume** so a warm resume hits single-digit seconds at scale, and what does that cost in storage?
- A node dies with multiple running **persistent** boxes on it. Walk through exactly how the system detects this and recovers, and state the data-loss guarantee.
- How do you keep **secrets** out of base images and RAM snapshots while still injecting them at runtime?
- How does the **scheduler** decide where to place a new or resuming box, and how do you avoid both fragmentation and noisy-neighbor contention?
Quick Answer: This question evaluates a candidate's ability to design scalable, secure cloud development platforms, testing competencies in distributed systems architecture, orchestration and scheduling, storage and persistence, multi-tenancy and RBAC, runtime isolation, and operational observability.