PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/OpenAI

Design a Cloud DevBox Platform

Last updated: Jun 17, 2026

Quick Overview

This question evaluates a candidate's ability to design scalable, secure cloud development platforms, testing competencies in distributed systems architecture, orchestration and scheduling, storage and persistence, multi-tenancy and RBAC, runtime isolation, and operational observability.

  • hard
  • OpenAI
  • System Design
  • Software Engineer

Design a Cloud DevBox Platform

Company: OpenAI

Role: Software Engineer

Category: System Design

Difficulty: hard

Interview Round: Technical Screen

Design a cloud **DevBox** platform: a service that gives developers disposable or persistent remote development machines, accessible through a browser, SSH, or IDE plugins (e.g. a VS Code / JetBrains remote tunnel). The platform should support the full lifecycle of a dev box and the operational needs of an organization: - **Lifecycle**: users can create, start, stop, pause, resume, and delete dev boxes from predefined templates. - **Box contents**: each dev box bundles an operating system image, language/toolchain dependencies, a repository checkout, a user home directory, injected secrets, and allocated CPU, memory, storage, and optional GPU. - **Multi-tenancy**: organizations and teams, quota enforcement, role-based access control (RBAC), audit logs, and cost controls. - **Quality bar**: optimize for fast provisioning, high availability, scalability to many concurrent dev boxes, strong isolation, reliable persistence, and operational observability. Walk through requirements, the high-level architecture, the lifecycle/state model, scheduling and provisioning, isolation, storage/persistence, failure handling, and observability/cost. Call out the major tradeoffs and where the system is actually bottlenecked. ```hint Frame the split first The single most important structural decision is a **control plane / data plane split**. The control plane (API, RBAC, scheduler, metadata DB) makes decisions and holds durable state; the data plane (compute nodes, runtimes, the connection gateway) runs the boxes and carries traffic. Aim for a design where a control-plane outage blocks only *new* mutations and never kills *running* boxes or live sessions. ``` ```hint Model lifecycle as desired vs actual state Don't hand-code each transition imperatively. Give every box a `desired_state` (set by the API) and an `actual_state`, and have an idempotent, leased **reconciler** drive `actual → desired`. This is what makes the system self-healing and safe under retries, duplicate events, and crashed workers. Be precise about **stop** (release compute, keep the workspace disk, cold boot to resume) vs **pause** (snapshot RAM + disk, warm resume with processes intact). ``` ```hint Provisioning latency is the hard NFR "Fast provisioning" is where the real engineering hides, and the win comes from *not doing work on the critical path* — pre-doing it, sharing it, or restoring it instead of rebuilding it. Decompose a cold create into its serial costs (boot, image fetch, dependency build, workspace attach) and attack whichever dominates. Identify the actual bottleneck before optimizing — it is *not* the small metadata DB. ``` ```hint Let the trust model force the isolation choice Pin down the trust model early. Untrusted, arbitrary user code points you at a hardware boundary (**microVM** / VM), not a shared-kernel container. This one fork shapes the runtime, the node fleet, and the entire security posture, so state it explicitly. ``` ### Constraints & Assumptions State these explicitly; they size the design. Reasonable defaults if the interviewer doesn't specify: - **Workloads are untrusted** — boxes run arbitrary, possibly hostile user code, so isolation must be a real kernel/hardware boundary. - **Pause/resume (RAM snapshot) is in scope**, not just stop/start. - **Multi-region** deployment for latency and resilience. - **Scale**: on the order of $10^5$ total boxes (e.g. ~200k), of which a smaller fraction (~10–20%) are running concurrently at peak; thousands of organizations. - **Provisioning SLO targets**: warm resume in single-digit seconds; cold create of a common template in tens of seconds, not minutes. - **Availability target**: control-plane HA; a control-plane outage must not terminate running boxes or live sessions. - **Identity**: an external OIDC/SSO identity provider already exists and is consumed (not designed here). - **Out of scope**: building the in-box IDE itself, and the billing/invoicing system (you emit metering events that feed it, not invoices). ### Clarifying Questions to Ask - **Trust model**: are workloads untrusted (arbitrary, possibly hostile code) or trusted internal developers? This is the single biggest design fork. - Is **pause/resume** (RAM snapshot) required, or only stop/start? - What is the expected **steady-state concurrency** and the **create rate** (including peak bursts like a Monday-morning login spike)? - **Single-region or multi-region** for v1? - Are boxes mostly **persistent** (survive node failure) or **ephemeral/scratch**, and what is the acceptable data-loss bound for each? - Is there a **GPU** requirement, and what fraction of boxes need it? ### What a Strong Answer Covers - A clean **control-plane / data-plane separation**, with an explicit statement of what keeps working during a control-plane outage. - A **desired-state lifecycle model** with a reconciler, leasing/idempotency, and the **stop-vs-pause** distinction handled correctly (including transient states and `FAILED`). - A **cold-start / provisioning** strategy (warm pools, COW images, image caches, prebuilds, RAM-snapshot resume) tied to the latency SLO. - A defensible **isolation choice** (microVM vs container) justified by the stated trust model, plus secrets, network, and egress controls. - A coherent **storage/persistence** model that separates immutable environment data from mutable user data and survives node failure. - **Multi-tenancy** mechanics: org/team/RBAC, quotas, audit logs, idempotency, and async lifecycle APIs. - **Scheduling & placement**: filter-then-score with image-cache affinity and tenant anti-affinity. - **Failure handling**: node loss, provisioning failure, gateway failure, duplicate/out-of-order events. - **Observability and cost controls** (provisioning-latency metrics, per-tenant metering, idle auto-stop/pause, budget caps). - Honest identification of where the system is **bottlenecked** (compute capacity, image-pull bandwidth, snapshot/attach latency — not the small metadata DB) and where the optimization effort goes. ### Follow-up Questions - How do you implement **pause/resume** so a warm resume hits single-digit seconds at scale, and what does that cost in storage? - A node dies with multiple running **persistent** boxes on it. Walk through exactly how the system detects this and recovers, and state the data-loss guarantee. - How do you keep **secrets** out of base images and RAM snapshots while still injecting them at runtime? - How does the **scheduler** decide where to place a new or resuming box, and how do you avoid both fragmentation and noisy-neighbor contention?

Quick Answer: This question evaluates a candidate's ability to design scalable, secure cloud development platforms, testing competencies in distributed systems architecture, orchestration and scheduling, storage and persistence, multi-tenancy and RBAC, runtime isolation, and operational observability.

Related Interview Questions

  • Design Video Generation Orchestration - OpenAI (medium)
  • Design CI/CD Build Caching - OpenAI
  • Design an Instagram-like Feed System - OpenAI (medium)
  • Design Online Chess Matchmaking - OpenAI (hard)
  • Design Android MVVM API Architecture - OpenAI (medium)
OpenAI logo
OpenAI
Apr 20, 2026, 12:00 AM
Software Engineer
Technical Screen
System Design
38
0

Design a cloud DevBox platform: a service that gives developers disposable or persistent remote development machines, accessible through a browser, SSH, or IDE plugins (e.g. a VS Code / JetBrains remote tunnel).

The platform should support the full lifecycle of a dev box and the operational needs of an organization:

  • Lifecycle : users can create, start, stop, pause, resume, and delete dev boxes from predefined templates.
  • Box contents : each dev box bundles an operating system image, language/toolchain dependencies, a repository checkout, a user home directory, injected secrets, and allocated CPU, memory, storage, and optional GPU.
  • Multi-tenancy : organizations and teams, quota enforcement, role-based access control (RBAC), audit logs, and cost controls.
  • Quality bar : optimize for fast provisioning, high availability, scalability to many concurrent dev boxes, strong isolation, reliable persistence, and operational observability.

Walk through requirements, the high-level architecture, the lifecycle/state model, scheduling and provisioning, isolation, storage/persistence, failure handling, and observability/cost. Call out the major tradeoffs and where the system is actually bottlenecked.

Constraints & Assumptions

State these explicitly; they size the design. Reasonable defaults if the interviewer doesn't specify:

  • Workloads are untrusted — boxes run arbitrary, possibly hostile user code, so isolation must be a real kernel/hardware boundary.
  • Pause/resume (RAM snapshot) is in scope , not just stop/start.
  • Multi-region deployment for latency and resilience.
  • Scale : on the order of 10510^5105 total boxes (e.g. ~200k), of which a smaller fraction (~10–20%) are running concurrently at peak; thousands of organizations.
  • Provisioning SLO targets : warm resume in single-digit seconds; cold create of a common template in tens of seconds, not minutes.
  • Availability target : control-plane HA; a control-plane outage must not terminate running boxes or live sessions.
  • Identity : an external OIDC/SSO identity provider already exists and is consumed (not designed here).
  • Out of scope : building the in-box IDE itself, and the billing/invoicing system (you emit metering events that feed it, not invoices).

Clarifying Questions to Ask

  • Trust model : are workloads untrusted (arbitrary, possibly hostile code) or trusted internal developers? This is the single biggest design fork.
  • Is pause/resume (RAM snapshot) required, or only stop/start?
  • What is the expected steady-state concurrency and the create rate (including peak bursts like a Monday-morning login spike)?
  • Single-region or multi-region for v1?
  • Are boxes mostly persistent (survive node failure) or ephemeral/scratch , and what is the acceptable data-loss bound for each?
  • Is there a GPU requirement, and what fraction of boxes need it?

What a Strong Answer Covers

  • A clean control-plane / data-plane separation , with an explicit statement of what keeps working during a control-plane outage.
  • A desired-state lifecycle model with a reconciler, leasing/idempotency, and the stop-vs-pause distinction handled correctly (including transient states and FAILED ).
  • A cold-start / provisioning strategy (warm pools, COW images, image caches, prebuilds, RAM-snapshot resume) tied to the latency SLO.
  • A defensible isolation choice (microVM vs container) justified by the stated trust model, plus secrets, network, and egress controls.
  • A coherent storage/persistence model that separates immutable environment data from mutable user data and survives node failure.
  • Multi-tenancy mechanics: org/team/RBAC, quotas, audit logs, idempotency, and async lifecycle APIs.
  • Scheduling & placement : filter-then-score with image-cache affinity and tenant anti-affinity.
  • Failure handling : node loss, provisioning failure, gateway failure, duplicate/out-of-order events.
  • Observability and cost controls (provisioning-latency metrics, per-tenant metering, idle auto-stop/pause, budget caps).
  • Honest identification of where the system is bottlenecked (compute capacity, image-pull bandwidth, snapshot/attach latency — not the small metadata DB) and where the optimization effort goes.

Follow-up Questions

  • How do you implement pause/resume so a warm resume hits single-digit seconds at scale, and what does that cost in storage?
  • A node dies with multiple running persistent boxes on it. Walk through exactly how the system detects this and recovers, and state the data-loss guarantee.
  • How do you keep secrets out of base images and RAM snapshots while still injecting them at runtime?
  • How does the scheduler decide where to place a new or resuming box, and how do you avoid both fragmentation and noisy-neighbor contention?

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More OpenAI•More Software Engineer•OpenAI Software Engineer•OpenAI System Design•Software Engineer System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.