Diagnose why a scaled system became slow
Company: Atlassian
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Technical Screen
You are on-call for a production service that recently scaled up (more instances, more users/traffic). After the scale-up, users report the system is “much slower” (higher latency, timeouts), even though the service is still functional.
Design a practical, step-by-step troubleshooting approach to identify the bottleneck(s) and stabilize the system.
Cover at least:
- What metrics and dashboards you would check first (client, load balancer, service, downstream dependencies).
- How you would isolate whether the issue is CPU, memory/GC, disk I/O, network, database, cache, or a specific dependency.
- How you would use logs, tracing, and profiling to narrow it down.
- Immediate mitigations vs. longer-term fixes.
- Common “scaled-up system got slower” root causes (e.g., thundering herd, connection pool saturation, cache miss storms, lock contention, hot partitions).
Quick Answer: This question evaluates proficiency in diagnosing performance regressions in scaled production services, emphasizing observability, bottleneck identification across components (compute, memory, I/O, network, caches, databases), and incident triage skills.