Scenario
Design a CI/CD platform similar to GitHub Actions/Jenkins that:
-
Triggers pipelines on events (e.g., push/PR/merge).
-
Runs pipelines as a DAG of steps (build/test/deploy).
-
Executes steps on a fleet of workers.
-
Exposes status to users (queued/running/succeeded/failed/canceled).
Required deep dive
A known issue: sometimes jobs get stuck in RUNNING forever (e.g., worker crashes, network partition). Explain how you would:
-
Detect stuck
RUNNING
jobs.
-
Transition them safely to a terminal state.
-
Avoid incorrectly failing slow-but-legitimate jobs.
-
Make the system robust to retries and duplicates.
You may assume a multi-tenant environment and that correctness of job state is important.