Design CI/CD Build Caching
Company: OpenAI
Role: Software Engineer
Category: System Design
Interview Round: Technical Screen
You are given a simple CI/CD platform. Users submit workflow definitions in YAML. A workflow contains multiple jobs, and **for this exercise the jobs run sequentially**. Each job executes inside a container-like build environment (a fresh, isolated sandbox).
Today the platform has no caching: every job starts from scratch, re-downloads all dependencies, and recompiles everything. Your task is to extend the system to support **build caching, custom image layers, cross-job reuse, and artifact upload to object storage** — while preserving correctness and tenant isolation.
The problem has four parts. Treat "cache" precisely: in CI/CD it can mean a *dependency cache*, a *build-output (compiler) cache*, an *artifact*, or a *container image/layer cache* — and these differ in correctness, size, and invalidation rules. Do not collapse them into one generic key-value store.
### Constraints & Assumptions
- Multi-tenant platform: many organizations and repositories share the worker fleet and object storage.
- Workers are stateless and ephemeral; a job may land on any worker, and workers can crash mid-job.
- Object storage (S3-compatible) is available for blobs (cache archives, artifacts, image layers).
- Builds must be **correct**: a cache hit must never produce a different result than a clean build.
- Cache archives, artifacts, and image layers can be large blobs, so treat storage and transfer cost as a real constraint.
- The interviewer gave no specific scale; state your own assumptions and call out where they would change the design.
### Part 1 — Build caching
Add build caching so repeated builds can reuse previously produced **dependencies** (e.g. `~/.m2`, `node_modules`), **intermediate build outputs** (compiler/object caches), or **container layers**. Define how a user opts in via YAML, where caches are stored, and how the worker restores and saves them.
```hint What kind of cache
Don't treat "cache" as one thing — restoring a dependency directory and restoring compiler outputs have very different correctness properties. Ask, for each type: is a *partial* or approximate restore safe, or does a stale entry silently corrupt the build? The answer shapes how strict your keys need to be.
```
```hint Cache-key inputs
The key has to be a function of *every input that affects the output* — if you forget one, a hit returns a stale or incompatible cache. Enumerate the inputs that vary between builds (think environment, toolchain, and what pins each dependency) and ask whether tags vs. immutable digests matter. Then ask which cache types can safely fall back to a broader key and which cannot.
```
```hint Storage + commit
Blobs live in object storage with a small metadata record. The thing to get right: a reader must never observe a half-written or corrupt cache. Think about the ordering between uploading the blob, verifying it, and making it visible — and what guarantee an entry needs once it *is* visible so a concurrent writer can't mutate it underneath a reader.
```
### Part 2 — Custom image layers
Let users define reusable build environments with specific compilers, runtimes, and dependencies, declared in YAML. Explain how the platform turns this into reusable, cacheable layers and when it can skip rebuilding a layer.
```hint Model it like a container build
Borrow the layering model containers already use: make each layer **content-addressed** by some digest, so an unchanged layer is reused and a changed one is rebuilt. The hard part is deciding what goes *into* that digest — list everything that could change the layer's contents (parent, the command, copied files, env, args) and convince yourself nothing that affects the output is left out.
```
```hint The correctness trap
Resolve mutable tags (`ubuntu:latest`) to immutable digests before hashing — otherwise the same YAML produces a different environment over time and your cache is silently wrong.
```
### Part 3 — Cache reuse across sequential jobs
Jobs in a workflow run in sequence and may need to share data. Design how a later job reuses work from an earlier one, while preserving correctness and isolation **when two jobs use different compilers or dependencies**.
```hint Not everything shared is the same thing
Before designing the handoff, separate *required* data a later job genuinely depends on from data that's only there to make things faster from the reusable *environment* itself — they have different correctness contracts. Then revisit the interviewer's "different compilers/dependencies" probe: if your keying scheme is right, do two such jobs even *want* to share, and what stops them from colliding?
```
### Part 4 — Artifact upload to object storage
A job may produce artifacts that downstream jobs need. Design uploading them to S3-like storage, including how to handle **chunk upload failures, worker crashes, retries, and cleanup of partial/abandoned uploads**.
```hint Resumable, idempotent upload
A single PUT can't survive a chunk failure or a worker crash, so reach for the object store's chunked-upload protocol. The properties you need: each chunk retried independently, a replacement worker able to discover what's already uploaded and finish (rather than restart from zero), and a way to both abort a doomed upload and automatically reap ones nobody ever finished. Which S3 primitives give you each of those?
```
```hint Visibility + downstream gating
An artifact is a *required* input, not an optimization, so think about its visibility states. When does it become safe to consume, what verifies it's intact, and how should a downstream job behave while it's still uploading versus when the upload failed or expired? Compare that contract to how a cache miss is allowed to behave.
```
### Clarifying Questions to Ask
- Which cache type is in scope — dependency cache, build-output cache, image-layer cache, or all of them?
- Are caches/artifacts scoped per-repo, per-branch, per-PR, or shared across an organization? What's the isolation requirement for forked-PR builds?
- What are the size and retention expectations — do we need quotas, TTLs, and eviction?
- Must builds be bit-for-bit reproducible, or is "faster but occasionally a clean rebuild" acceptable?
- Do we control the worker fleet (can we pin caches locally) or is it a fully ephemeral, autoscaled pool?
- Is the YAML schema fixed, or can we add new top-level keys (`cache:`, `images:`, `artifacts:`)?
### What a Strong Answer Covers
- **Disambiguating "cache"** up front and treating dependency / build-output / image-layer / artifact caches with different correctness and invalidation rules.
- A concrete **YAML/API surface** for opting into caches, defining custom images, and declaring artifact upload/download.
- A **data model**: cache/artifact metadata (key, scope, type, object URI, checksum, size, status, TTL) plus blob storage, with the write-then-commit commit protocol.
- **Core components**: workflow/parse service, scheduler, worker/runner, cache-metadata service, artifact service, image/layer store, object storage.
- A precise **cache-key strategy** (which inputs are hashed; digests not tags; `restoreKeys` only where safe) and **invalidation** (content keys, TTL, version field, quota/LRU eviction).
- The **artifacts-vs-caches-vs-layers** distinction for cross-job reuse, and why different compilers/deps don't collide.
- **Failure handling**: multipart resume on crash, chunk retries, partial-upload cleanup, stuck-`uploading` expiry, immutability.
- **Security/isolation**: tenant/repo/branch scoping, fork-PR cache poisoning, never caching secrets, encryption, short-lived presigned URLs.
- **Trade-offs** (local vs remote cache, dependency vs build-output cache) and **observability** (hit rate by type, time saved, upload success rate).
### Follow-up Questions
- A malicious fork PR runs in the same repo's build pipeline. How do you prevent it from reading a privileged cache that contains secrets, or *poisoning* a cache that the protected branch later restores?
- Two workers finish the same job with the same cache key at nearly the same time and both try to write. What's your write/commit protocol, and who wins?
- The platform later wants to run jobs **in parallel** instead of sequentially. What in your design breaks, and what changes?
- Object storage egress and the metadata service are both getting expensive. How would you decide what to cache, set quotas/eviction, and measure whether caching is actually paying for itself?
Quick Answer: This question evaluates design skills around CI/CD build caching, including distinguishing cache types (dependency caches, compiler outputs, container image layers), cache-key correctness, object-storage durability, and tenant isolation in a distributed, ephemeral worker environment.