PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/OpenAI

Design a multi-tenant CI/CD platform

Last updated: Jun 15, 2026

Quick Overview

An OpenAI software-engineering system-design screen asking you to design a scalable, multi-tenant CI/CD platform. It spans Git/event intake, YAML workflow parsing into a job DAG, fair scheduling and runner isolation, live log and status streaming, per-tenant quotas and billing, observability, audit logging, disaster recovery, and zero-downtime deploys.

  • hard
  • OpenAI
  • System Design
  • Software Engineer

Design a multi-tenant CI/CD platform

Company: OpenAI

Role: Software Engineer

Category: System Design

Difficulty: hard

Interview Round: Technical Screen

##### Question Design a multi-tenant CI/CD system that runs pipelines triggered by Git activity, isolates tenants from one another, and scales to thousands of organizations. Walk through the architecture end to end, then go deep on the operational concerns below. 1. **Event intake / SCM integration.** The system receives push events through an internal service API (or SCM webhooks), including the repository ID, commit SHA, branch, and current repository state. Describe validation, authentication, deduplication, persistence, and fan-out. How do you apply backpressure when downstream is slow? 2. **Workflow definition, parsing, and planning.** Workflows are defined as a sequence (or DAG) of jobs in a single YAML file at a fixed path in each repo. Explain how you fetch the file at the pushed commit, validate the schema, build the job DAG (including matrix expansion), and persist an immutable run plan. 3. **Orchestration and execution semantics.** Define the run/job state machine, dependency handling, conditionals/matrices, retries with backoff, timeouts, idempotency/dedup, cancellations, and approval gates. 4. **Scheduling and fairness.** How do you schedule jobs across tenants fairly (e.g., weighted fair queuing), enforce per-tenant concurrency caps and quotas, and protect against noisy neighbors? 5. **Build runners / executors and isolation.** Compare runner form factors (containers, microVMs, dedicated pools). How do you fetch source at the commit, inject scoped secrets, restore caches, control egress, and recover from runner crashes via heartbeats and leases? 6. **Live logs and status streaming.** Users must be able to view job output and execution status while runs are in progress. Describe streaming from runner to control plane and from control plane to the browser, with resumable cursors and retention. 7. **Storage model.** Lay out the metadata database, object store (logs/artifacts), message bus, cache, and secret store. 8. **Tenant isolation, RBAC, and security.** Cover namespaces, per-tenant runners, network and data isolation (e.g., row-level security), KMS-backed secrets, RBAC roles/scopes, and a policy engine. 9. **Scaling.** Autoscaling runners, horizontal scaling of a stateless control plane, scheduler sharding, DB read replicas/partitioning, and optional multi-region. 10. **Cost tracking and billing per tenant.** Which meters do you capture (CPU/RAM/GPU-seconds, storage GB-days, egress), how do you aggregate them, and how do budgets/alerts work? 11. **Observability and audit.** Structured logs, metrics, distributed traces, and a tamper-evident audit log. 12. **Fault tolerance, disaster recovery, and zero-downtime deployments.** At-least-once event processing with idempotent consumers, backups/PITR, region failover, and backward-compatible (expand-and-contract) migrations. 13. **Trade-offs and rollout.** Shared vs. dedicated runners, managed vs. self-hosted agents, and a phased deployment plan.

Quick Answer: An OpenAI software-engineering system-design screen asking you to design a scalable, multi-tenant CI/CD platform. It spans Git/event intake, YAML workflow parsing into a job DAG, fair scheduling and runner isolation, live log and status streaming, per-tenant quotas and billing, observability, audit logging, disaster recovery, and zero-downtime deploys.

Related Interview Questions

  • Design Video Generation Orchestration - OpenAI (medium)
  • Design CI/CD Build Caching - OpenAI
  • Design an Instagram-like Feed System - OpenAI (medium)
  • Design Online Chess Matchmaking - OpenAI (hard)
  • Design Android MVVM API Architecture - OpenAI (medium)
OpenAI logo
OpenAI
Sep 6, 2025, 12:00 AM
Software Engineer
Technical Screen
System Design
10
0
Question

Design a multi-tenant CI/CD system that runs pipelines triggered by Git activity, isolates tenants from one another, and scales to thousands of organizations. Walk through the architecture end to end, then go deep on the operational concerns below.

  1. Event intake / SCM integration. The system receives push events through an internal service API (or SCM webhooks), including the repository ID, commit SHA, branch, and current repository state. Describe validation, authentication, deduplication, persistence, and fan-out. How do you apply backpressure when downstream is slow?
  2. Workflow definition, parsing, and planning. Workflows are defined as a sequence (or DAG) of jobs in a single YAML file at a fixed path in each repo. Explain how you fetch the file at the pushed commit, validate the schema, build the job DAG (including matrix expansion), and persist an immutable run plan.
  3. Orchestration and execution semantics. Define the run/job state machine, dependency handling, conditionals/matrices, retries with backoff, timeouts, idempotency/dedup, cancellations, and approval gates.
  4. Scheduling and fairness. How do you schedule jobs across tenants fairly (e.g., weighted fair queuing), enforce per-tenant concurrency caps and quotas, and protect against noisy neighbors?
  5. Build runners / executors and isolation. Compare runner form factors (containers, microVMs, dedicated pools). How do you fetch source at the commit, inject scoped secrets, restore caches, control egress, and recover from runner crashes via heartbeats and leases?
  6. Live logs and status streaming. Users must be able to view job output and execution status while runs are in progress. Describe streaming from runner to control plane and from control plane to the browser, with resumable cursors and retention.
  7. Storage model. Lay out the metadata database, object store (logs/artifacts), message bus, cache, and secret store.
  8. Tenant isolation, RBAC, and security. Cover namespaces, per-tenant runners, network and data isolation (e.g., row-level security), KMS-backed secrets, RBAC roles/scopes, and a policy engine.
  9. Scaling. Autoscaling runners, horizontal scaling of a stateless control plane, scheduler sharding, DB read replicas/partitioning, and optional multi-region.
  10. Cost tracking and billing per tenant. Which meters do you capture (CPU/RAM/GPU-seconds, storage GB-days, egress), how do you aggregate them, and how do budgets/alerts work?
  11. Observability and audit. Structured logs, metrics, distributed traces, and a tamper-evident audit log.
  12. Fault tolerance, disaster recovery, and zero-downtime deployments. At-least-once event processing with idempotent consumers, backups/PITR, region failover, and backward-compatible (expand-and-contract) migrations.
  13. Trade-offs and rollout. Shared vs. dedicated runners, managed vs. self-hosted agents, and a phased deployment plan.

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More OpenAI•More Software Engineer•OpenAI Software Engineer•OpenAI System Design•Software Engineer System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.