PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/OpenAI

Design a CI/CD pipeline with scheduler

Last updated: Jun 15, 2026

Quick Overview

An OpenAI software-engineering system design interview question: design a production-grade CI/CD pipeline for a large microservices monorepo, end to end. It spans source-control triggers, build and artifact management, layered testing with flaky-test handling, progressive deployment and rollback, observability, supply-chain security, and — the differentiator — the job scheduler that orchestrates pipeline steps (DAG modeling, concurrency, priorities, retries, caching, rate limiting) plus its data stores, queues, APIs, schemas, multi-tenant fairness, KPIs/SLOs, and failure handling.

  • hard
  • OpenAI
  • System Design
  • Software Engineer

Design a CI/CD pipeline with scheduler

Company: OpenAI

Role: Software Engineer

Category: System Design

Difficulty: hard

Interview Round: Technical Screen

##### Question Design a production-grade CI/CD pipeline for a large engineering organization running dozens of microservices in a monorepo. Walk through the end-to-end architecture and design decisions for each layer: 1. **Source control & triggers** — version-control integration, branch strategy (trunk-based vs. release branches), protected branches, merge queues, and triggers (push, PR, tag/release, schedule, manual). Include monorepo path filters / change detection for affected services. 2. **Build orchestration & artifact management** — ephemeral runners, pipeline-as-code, dependency and layer caching, reproducible/hermetic builds, container registry and artifact repository, retention/GC policies. 3. **Automated testing** — unit, integration, and end-to-end stages; test parallelism and sharding; ephemeral per-PR environments; test-impact analysis; and how you handle flaky / nondeterministic tests (retries, quarantine, determinism). 4. **Containerization vs. VM builds & reproducible environments** — when to use each and how to guarantee reproducibility. 5. **Deployment strategies** — rolling, blue-green, canary / progressive delivery, and feature flags; backward-compatible database migrations (expand/contract). 6. **Rollback, observability & incident response** — automated metric-gated rollback, metrics/logs/traces, release annotations, alerting and on-call hooks. 7. **Security, secrets & compliance** — SAST/DAST/SCA, secret scanning, SBOM generation, artifact signing and provenance (SLSA-style), policy-as-code gates, short-lived OIDC credentials, and audit/compliance trails. 8. **The job scheduler / orchestrator** — design and implement the scheduler that orchestrates pipeline steps. Cover dependency-graph (DAG) modeling, concurrency limits, priorities and fairness, retries with exponential backoff, caching, and rate limiting. Specify the supporting components (data stores, queues, coordination/locks) and provide APIs and schemas for pipelines, jobs, and logs. 9. **Scalability, multi-tenant isolation & fairness** — horizontal scaling of runners and the control plane, per-tenant quotas/budgets, and starvation prevention. 10. **Cost efficiency & high availability** — spot capacity, caching, retention, and HA/DR of the CI/CD platform itself. 11. **KPIs / SLOs & on-call playbook** — the delivery and platform metrics you would track (e.g., DORA) and the runbook for a failed deployment. Discuss the key trade-offs (velocity vs. control, monorepo vs. multirepo, push-based vs. GitOps) and walk through failure scenarios and their mitigations.

Quick Answer: An OpenAI software-engineering system design interview question: design a production-grade CI/CD pipeline for a large microservices monorepo, end to end. It spans source-control triggers, build and artifact management, layered testing with flaky-test handling, progressive deployment and rollback, observability, supply-chain security, and — the differentiator — the job scheduler that orchestrates pipeline steps (DAG modeling, concurrency, priorities, retries, caching, rate limiting) plus its data stores, queues, APIs, schemas, multi-tenant fairness, KPIs/SLOs, and failure handling.

Related Interview Questions

  • Design Video Generation Orchestration - OpenAI (medium)
  • Design CI/CD Build Caching - OpenAI
  • Design an Instagram-like Feed System - OpenAI (medium)
  • Design Online Chess Matchmaking - OpenAI (hard)
  • Design Android MVVM API Architecture - OpenAI (medium)
OpenAI logo
OpenAI
Aug 13, 2025, 12:00 AM
Software Engineer
Technical Screen
System Design
43
0

Design a production-grade CI/CD pipeline (with its job scheduler)

You are designing the internal CI/CD platform for a large engineering organization that runs dozens of microservices out of a single monorepo. Multiple teams share the platform, so it must be safe, fair, and trustworthy as a Tier-0 internal dependency.

Walk through the end-to-end architecture and justify the key design decisions at each layer below. Aim for senior-level depth: explain mechanisms and trade-offs, not just a list of tools.

Constraints & Assumptions

Anchor your design to a concrete scale so capacity, fairness, and SLO decisions are grounded. Unless you negotiate otherwise with the interviewer, assume:

  • Tenants & scale: ~50 teams (tenants), ~5,000 engineers, dozens of microservices in a single monorepo, on the order of ~20k jobs/day with sharp peaks clustered around merge/CI-trigger times.
  • Topology: containerized workloads on Kubernetes; a Git provider reachable via webhooks and a rate-limited API.
  • Reliability bar: the platform is Tier-0 — engineers cannot ship if it is down, so it must degrade gracefully when a dependency (registry, Git provider, object store) is impaired.
  • Delivery bar: a typical PR should reach a deploy decision in tens of minutes , not hours.
  • Security bar: least privilege, no long-lived static cloud keys in CI, and an immutable audit trail of who triggered/approved/deployed what.

State any assumption you add explicitly; the interviewer cares more about consistent reasoning from your stated numbers than the specific figures.

Clarifying Questions to Ask

Before diving in, scope the problem with the interviewer:

  • Scale & traffic shape — how many tenants, jobs/day, and how peaky is the load around merge times? What is the largest single pipeline (number of jobs / fan-out width)?
  • Latency & SLO targets — what is the acceptable PR-to-deploy-decision time, queue-wait p95, and platform availability target (this is a Tier-0 dependency)?
  • Multi-tenancy & trust — do tenants trust each other, or must we assume hostile/buggy jobs (affecting isolation, secret handling, and fork-PR policy)?
  • Deployment surface — are we deploying to a single cluster/region or multi-region, and is there a regulated/on-prem segment that needs long-lived release branches?
  • Build/test characteristics — polyglot stacks? Is there an existing graph-aware build system (Bazel/Nx/Gradle) we can lean on for change detection, or only path globs?
  • Compliance — what audit, provenance, and separation-of-duties requirements must we satisfy (e.g. two-person approval, SBOM, signed artifacts)?

What a Strong Answer Covers Premium

What to cover

  1. Source control & triggers — version-control integration, branch strategy (trunk-based vs. release branches), protected branches, merge queues, and triggers (push, PR, tag/release, schedule, manual). Include monorepo path filters / change detection so only affected services build.
  2. Build orchestration & artifact management — ephemeral runners, pipeline-as-code, dependency and layer caching, reproducible/hermetic builds, container registry and artifact repository, and retention/GC policies.
  3. Automated testing — unit, integration, and end-to-end stages; test parallelism and sharding; ephemeral per-PR environments; test-impact analysis; and how you handle flaky / nondeterministic tests (retries, quarantine, determinism).
  4. Containerization vs. VM builds & reproducible environments — when to use each, and how you guarantee reproducibility.
  5. Deployment strategies — rolling, blue-green, canary / progressive delivery, and feature flags; plus backward-compatible database migrations (expand/contract).
  6. Rollback, observability & incident response — automated metric-gated rollback; metrics, logs, and traces; release annotations; alerting and on-call hooks.
  7. Security, secrets & compliance — SAST/DAST/SCA, secret scanning, SBOM generation, artifact signing and provenance (SLSA-style), policy-as-code gates, short-lived OIDC credentials, and audit/compliance trails.
  8. The job scheduler / orchestrator (core — design and implement it). This is the centerpiece. Be concrete:
    • DAG modeling of pipeline steps (dependencies, conditional edges, fan-out/fan-in).
    • Concurrency limits, priorities, and fairness across tenants.
    • Retries with exponential backoff , caching, and rate limiting .
    • The supporting components (data stores, queues, coordination/locks) and why .
    • Concrete APIs and schemas for pipelines, jobs, and logs.
  9. Scalability, multi-tenant isolation & fairness — horizontal scaling of runners and the control plane, per-tenant quotas/budgets, and starvation prevention.
  10. Cost efficiency & high availability — spot capacity, caching, retention, and HA/DR of the CI/CD platform itself.
  11. KPIs / SLOs & on-call playbook — the delivery and platform metrics you would track (e.g., DORA) and the runbook for a failed deployment.

Also discuss

  • Key trade-offs: velocity vs. control, monorepo vs. multirepo, and push-based vs. GitOps.
  • Failure scenarios and their mitigations across the system.

Follow-up Questions

Be ready for the interviewer to probe deeper:

  • What breaks first at 10x the job rate — the scheduler control loop, the metadata DB, the queue, or the Git provider's rate limit? How do you find out, and how do you mitigate the bottleneck you named?
  • A scheduler replica GC-pauses for several seconds while holding a "lock" and another replica takes over — how does your design prevent a double-dispatch or a double-deploy?
  • One tenant floods the queue with thousands of jobs. Walk through exactly how a second tenant's urgent prod-hotfix still gets scheduled promptly.
  • A registry/object-store outage hits mid-deploy. Which flows must keep working, which can fail closed, and how does the platform degrade gracefully as a Tier-0 dependency?

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More OpenAI•More Software Engineer•OpenAI Software Engineer•OpenAI System Design•Software Engineer System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.