Design a CI/CD pipeline with scheduler
Company: OpenAI
Role: Software Engineer
Category: System Design
Difficulty: hard
Interview Round: Technical Screen
##### Question
Design a production-grade CI/CD pipeline for a large engineering organization running dozens of microservices in a monorepo. Walk through the end-to-end architecture and design decisions for each layer:
1. **Source control & triggers** — version-control integration, branch strategy (trunk-based vs. release branches), protected branches, merge queues, and triggers (push, PR, tag/release, schedule, manual). Include monorepo path filters / change detection for affected services.
2. **Build orchestration & artifact management** — ephemeral runners, pipeline-as-code, dependency and layer caching, reproducible/hermetic builds, container registry and artifact repository, retention/GC policies.
3. **Automated testing** — unit, integration, and end-to-end stages; test parallelism and sharding; ephemeral per-PR environments; test-impact analysis; and how you handle flaky / nondeterministic tests (retries, quarantine, determinism).
4. **Containerization vs. VM builds & reproducible environments** — when to use each and how to guarantee reproducibility.
5. **Deployment strategies** — rolling, blue-green, canary / progressive delivery, and feature flags; backward-compatible database migrations (expand/contract).
6. **Rollback, observability & incident response** — automated metric-gated rollback, metrics/logs/traces, release annotations, alerting and on-call hooks.
7. **Security, secrets & compliance** — SAST/DAST/SCA, secret scanning, SBOM generation, artifact signing and provenance (SLSA-style), policy-as-code gates, short-lived OIDC credentials, and audit/compliance trails.
8. **The job scheduler / orchestrator** — design and implement the scheduler that orchestrates pipeline steps. Cover dependency-graph (DAG) modeling, concurrency limits, priorities and fairness, retries with exponential backoff, caching, and rate limiting. Specify the supporting components (data stores, queues, coordination/locks) and provide APIs and schemas for pipelines, jobs, and logs.
9. **Scalability, multi-tenant isolation & fairness** — horizontal scaling of runners and the control plane, per-tenant quotas/budgets, and starvation prevention.
10. **Cost efficiency & high availability** — spot capacity, caching, retention, and HA/DR of the CI/CD platform itself.
11. **KPIs / SLOs & on-call playbook** — the delivery and platform metrics you would track (e.g., DORA) and the runbook for a failed deployment.
Discuss the key trade-offs (velocity vs. control, monorepo vs. multirepo, push-based vs. GitOps) and walk through failure scenarios and their mitigations.
Quick Answer: An OpenAI software-engineering system design interview question: design a production-grade CI/CD pipeline for a large microservices monorepo, end to end. It spans source-control triggers, build and artifact management, layered testing with flaky-test handling, progressive deployment and rollback, observability, supply-chain security, and — the differentiator — the job scheduler that orchestrates pipeline steps (DAG modeling, concurrency, priorities, retries, caching, rate limiting) plus its data stores, queues, APIs, schemas, multi-tenant fairness, KPIs/SLOs, and failure handling.