You are on call for a Linux-based web application that lets customers submit orders and receive confirmations. In the middle of the night, you receive an alert that the service is down. The incident is still ongoing, all users are affected, there was no recent code deployment, and traffic volume has not changed.
You do not have dashboards or metrics. You can only investigate through Linux commands and by checking logs.
During debugging, you discover the following:
-
Application and system logs show that one API is returning HTTP 500 errors.
-
That API fails while trying to write data to the database.
-
The server disk is not full.
-
The main database storage is not full.
-
A request queue is also not backed up.
Given these findings:
-
Explain how you would systematically troubleshoot the outage.
-
Identify the most likely root cause.
-
Describe how you would confirm it.
-
Describe both the immediate mitigation and the long-term prevention steps.