Design a CI/CD system with stuck-job handling

Q: Design a CI/CD system with stuck-job handling

This is a System Design interview question from OpenAI for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

Loading...

Scenario

Design a CI/CD platform similar to GitHub Actions/Jenkins that:

Triggers pipelines on events (e.g., push/PR/merge).
Runs pipelines as a DAG of steps (build/test/deploy).
Executes steps on a fleet of workers.
Exposes status to users (queued/running/succeeded/failed/canceled).

Required deep dive

A known issue: sometimes jobs get stuck in RUNNING forever (e.g., worker crashes, network partition). Explain how you would:

Detect stuck RUNNING jobs.
Transition them safely to a terminal state.
Avoid incorrectly failing slow-but-legitimate jobs.
Make the system robust to retries and duplicates.

You may assume a multi-tenant environment and that correctness of job state is important.

Design a CI/CD system with stuck-job handling

Scenario

Required deep dive

Solution

Comments (0)