This question evaluates operational incident response and systems troubleshooting skills, including understanding of disk/storage behavior, log and metric analysis, and architecture-level impact assessment within the Software Engineering Fundamentals domain.
You are on-call for a production web service. Users report the website is down (timeouts/5xx). You are told the incident turns out to involve a full disk on one or more machines.
Walk through how you would:
Assume a typical setup (load balancer → web/app tier → DB/cache; logs/metrics available). Explain what commands/signals you would look at, what hypotheses you would test, and what pitfalls you’d avoid.