You are on-call for a production service that recently scaled up (more instances, more users/traffic). After the scale-up, users report the system is “much slower” (higher latency, timeouts), even though the service is still functional.
Design a practical, step-by-step troubleshooting approach to identify the bottleneck(s) and stabilize the system.
Cover at least:
-
What metrics and dashboards you would check first (client, load balancer, service, downstream dependencies).
-
How you would isolate whether the issue is CPU, memory/GC, disk I/O, network, database, cache, or a specific dependency.
-
How you would use logs, tracing, and profiling to narrow it down.
-
Immediate mitigations vs. longer-term fixes.
-
Common “scaled-up system got slower” root causes (e.g., thundering herd, connection pool saturation, cache miss storms, lock contention, hot partitions).