##### Question Design a multi-tenant CI/CD system that runs pipelines triggered by Git activity, isolates tenants from one another, and scales to thousands of organizations. Walk through the architecture end to end, then go deep on the operational concerns below. 1. **Event intake / SCM integration.** The system receives push events through an internal service API (or SCM webhooks), including the repository ID, commit SHA, branch, and current repository state. Describe validation, authentication, deduplication, persistence, and fan-out. How do you apply backpressure when downstream is slow? 2. **Workflow definition, parsing, and planning.** Workflows are defined as a sequence (or DAG) of jobs in a single YAML file at a fixed path in each repo. Explain how you fetch the file at the pushed commit, validate the schema, build the job DAG (including matrix expansion), and persist an immutable run plan. 3. **Orchestration and execution semantics.** Define the run/job state machine, dependency handling, conditionals/matrices, retries with backoff, timeouts, idempotency/dedup, cancellations, and approval gates. 4. **Scheduling and fairness.** How do you schedule jobs across tenants fairly (e.g., weighted fair queuing), enforce per-tenant concurrency caps and quotas, and protect against noisy neighbors? 5. **Build runners / executors and isolation.** Compare runner form factors (containers, microVMs, dedicated pools). How do you fetch source at the commit, inject scoped secrets, restore caches, control egress, and recover from runner crashes via heartbeats and leases? 6. **Live logs and status streaming.** Users must be able to view job output and execution status while runs are in progress. Describe streaming from runner to control plane and from control plane to the browser, with resumable cursors and retention. 7. **Storage model.** Lay out the metadata database, object store (logs/artifacts), message bus, cache, and secret store. 8. **Tenant isolation, RBAC, and security.** Cover namespaces, per-tenant runners, network and data isolation (e.g., row-level security), KMS-backed secrets, RBAC roles/scopes, and a policy engine. 9. **Scaling.** Autoscaling runners, horizontal scaling of a stateless control plane, scheduler sharding, DB read replicas/partitioning, and optional multi-region. 10. **Cost tracking and billing per tenant.** Which meters do you capture (CPU/RAM/GPU-seconds, storage GB-days, egress), how do you aggregate them, and how do budgets/alerts work? 11. **Observability and audit.** Structured logs, metrics, distributed traces, and a tamper-evident audit log. 12. **Fault tolerance, disaster recovery, and zero-downtime deployments.** At-least-once event processing with idempotent consumers, backups/PITR, region failover, and backward-compatible (expand-and-contract) migrations. 13. **Trade-offs and rollout.** Shared vs. dedicated runners, managed vs. self-hosted agents, and a phased deployment plan.

An OpenAI software-engineering system-design screen asking you to design a scalable, multi-tenant CI/CD platform. It spans Git/event intake, YAML workflow parsing into a job DAG, fair scheduling and runner isolation, live log and status streaming, per-tenant quotas and billing, observability, audit logging, disaster recovery, and zero-downtime deploys.

How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

What difficulty level is this interview question?

This is a hard difficulty System Design question, commonly asked during Technical Screen rounds at OpenAI.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at OpenAI during technical interviews.

Design a multi-tenant CI/CD platform | OpenAI Interview Question

Question

Design a multi-tenant CI/CD system that runs pipelines triggered by Git activity, isolates tenants from one another, and scales to thousands of organizations. Walk through the architecture end to end, then go deep on the operational concerns below.

Event intake / SCM integration. The system receives push events through an internal service API (or SCM webhooks), including the repository ID, commit SHA, branch, and current repository state. Describe validation, authentication, deduplication, persistence, and fan-out. How do you apply backpressure when downstream is slow?
Workflow definition, parsing, and planning. Workflows are defined as a sequence (or DAG) of jobs in a single YAML file at a fixed path in each repo. Explain how you fetch the file at the pushed commit, validate the schema, build the job DAG (including matrix expansion), and persist an immutable run plan.
Orchestration and execution semantics. Define the run/job state machine, dependency handling, conditionals/matrices, retries with backoff, timeouts, idempotency/dedup, cancellations, and approval gates.
Scheduling and fairness. How do you schedule jobs across tenants fairly (e.g., weighted fair queuing), enforce per-tenant concurrency caps and quotas, and protect against noisy neighbors?
Build runners / executors and isolation. Compare runner form factors (containers, microVMs, dedicated pools). How do you fetch source at the commit, inject scoped secrets, restore caches, control egress, and recover from runner crashes via heartbeats and leases?
Live logs and status streaming. Users must be able to view job output and execution status while runs are in progress. Describe streaming from runner to control plane and from control plane to the browser, with resumable cursors and retention.
Storage model. Lay out the metadata database, object store (logs/artifacts), message bus, cache, and secret store.
Tenant isolation, RBAC, and security. Cover namespaces, per-tenant runners, network and data isolation (e.g., row-level security), KMS-backed secrets, RBAC roles/scopes, and a policy engine.
Scaling. Autoscaling runners, horizontal scaling of a stateless control plane, scheduler sharding, DB read replicas/partitioning, and optional multi-region.
Cost tracking and billing per tenant. Which meters do you capture (CPU/RAM/GPU-seconds, storage GB-days, egress), how do you aggregate them, and how do budgets/alerts work?
Observability and audit. Structured logs, metrics, distributed traces, and a tamper-evident audit log.
Fault tolerance, disaster recovery, and zero-downtime deployments. At-least-once event processing with idempotent consumers, backups/PITR, region failover, and backward-compatible (expand-and-contract) migrations.
Trade-offs and rollout. Shared vs. dedicated runners, managed vs. self-hosted agents, and a phased deployment plan.

Question

Event intake / SCM integration. The system receives push events through an internal service API (or SCM webhooks), including the repository ID, commit SHA, branch, and current repository state. Describe validation, authentication, deduplication, persistence, and fan-out. How do you apply backpressure when downstream is slow?
Workflow definition, parsing, and planning. Workflows are defined as a sequence (or DAG) of jobs in a single YAML file at a fixed path in each repo. Explain how you fetch the file at the pushed commit, validate the schema, build the job DAG (including matrix expansion), and persist an immutable run plan.
Orchestration and execution semantics. Define the run/job state machine, dependency handling, conditionals/matrices, retries with backoff, timeouts, idempotency/dedup, cancellations, and approval gates.
Scheduling and fairness. How do you schedule jobs across tenants fairly (e.g., weighted fair queuing), enforce per-tenant concurrency caps and quotas, and protect against noisy neighbors?
Build runners / executors and isolation. Compare runner form factors (containers, microVMs, dedicated pools). How do you fetch source at the commit, inject scoped secrets, restore caches, control egress, and recover from runner crashes via heartbeats and leases?
Live logs and status streaming. Users must be able to view job output and execution status while runs are in progress. Describe streaming from runner to control plane and from control plane to the browser, with resumable cursors and retention.
Storage model. Lay out the metadata database, object store (logs/artifacts), message bus, cache, and secret store.
Tenant isolation, RBAC, and security. Cover namespaces, per-tenant runners, network and data isolation (e.g., row-level security), KMS-backed secrets, RBAC roles/scopes, and a policy engine.
Scaling. Autoscaling runners, horizontal scaling of a stateless control plane, scheduler sharding, DB read replicas/partitioning, and optional multi-region.
Cost tracking and billing per tenant. Which meters do you capture (CPU/RAM/GPU-seconds, storage GB-days, egress), how do you aggregate them, and how do budgets/alerts work?
Observability and audit. Structured logs, metrics, distributed traces, and a tamper-evident audit log.
Fault tolerance, disaster recovery, and zero-downtime deployments. At-least-once event processing with idempotent consumers, backups/PITR, region failover, and backward-compatible (expand-and-contract) migrations.
Trade-offs and rollout. Shared vs. dedicated runners, managed vs. self-hosted agents, and a phased deployment plan.

Design a multi-tenant CI/CD platform

Quick Overview

Question

Solution

Submit Your Answer to Earn 20XP

Design a multi-tenant CI/CD platform

Quick Overview

Question

Solution

Submit Your Answer to Earn 20XP