This question evaluates proficiency in diagnosing performance regressions in scaled production services, emphasizing observability, bottleneck identification across components (compute, memory, I/O, network, caches, databases), and incident triage skills.
You are on-call for a production service that recently scaled up (more instances, more users/traffic). After the scale-up, users report the system is “much slower” (higher latency, timeouts), even though the service is still functional.
Design a practical, step-by-step troubleshooting approach to identify the bottleneck(s) and stabilize the system.
Cover at least: