Diagnose why a scaled system became slow

Q: Diagnose why a scaled system became slow

This is a System Design interview question from Atlassian for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

Loading...

You are on-call for a production service that recently scaled up (more instances, more users/traffic). After the scale-up, users report the system is “much slower” (higher latency, timeouts), even though the service is still functional.

Design a practical, step-by-step troubleshooting approach to identify the bottleneck(s) and stabilize the system.

Cover at least:

What metrics and dashboards you would check first (client, load balancer, service, downstream dependencies).
How you would isolate whether the issue is CPU, memory/GC, disk I/O, network, database, cache, or a specific dependency.
How you would use logs, tracing, and profiling to narrow it down.
Immediate mitigations vs. longer-term fixes.
Common “scaled-up system got slower” root causes (e.g., thundering herd, connection pool saturation, cache miss storms, lock contention, hot partitions).

Diagnose why a scaled system became slow

Comments (0)