You own a single-machine web server (one host running the web service). Suddenly users report that a specific page (or the whole website) is down.
-
Live troubleshooting:
Walk through how you would triage and debug the outage end-to-end. The interviewer may introduce different scenarios (e.g., high latency, 5xx, timeouts, only one endpoint failing, intermittent failures).
-
Resilience improvements:
After mitigation, propose how to redesign/operate the system to be more resilient and reduce the blast radius of similar failures in the future (you can choose the architecture and operational practices).
Assume you have standard production access (logs/metrics, SSH, ability to roll back/deploy, etc.), but start from a single-node baseline.