This question evaluates a candidate's competence in designing fault-tolerant distributed systems, focusing on state management, failure detection, safe terminal transitions, and correctness of job lifecycle handling in a multi-tenant CI/CD environment.
Design a CI/CD platform similar to GitHub Actions/Jenkins that:
A known issue: sometimes jobs get stuck in RUNNING forever (e.g., worker crashes, network partition). Explain how you would:
RUNNING
jobs.
You may assume a multi-tenant environment and that correctness of job state is important.
Login required