Troubleshooting Prompt: Intermittent Slowness in a Web Application
A client reports that a web application is intermittently very slow when generating reports or performing searches. No code has changed in the last five months.
Assume a typical multi-tier architecture: browser, CDN or load balancer, application services, database and search cluster, caches, background jobs, and external dependencies.
Explain your step-by-step troubleshooting approach.
Constraints & Assumptions
-
Treat this as an incident and diagnosis problem, not just a generic performance checklist.
-
Intermittent slowness may come from data growth, traffic shifts, infrastructure changes, cache behavior, database plans, search indexing, background jobs, or network issues even when application code has not changed.
-
The answer should show how you scope impact, localize the bottleneck, form hypotheses, test safely, and prioritize fixes.
Clarifying Questions to Ask
-
Which actions are slow: reports, searches, login, page load, exports, or all requests?
-
What timestamps, user IDs, accounts, geographies, browsers, and query parameters are affected?
-
Is the slowness in time to first byte, client rendering, backend processing, database query, search query, or download?
-
Did data volume, traffic, configuration, infrastructure, dependencies, or scheduled jobs change recently?
-
What SLOs or customer-impact thresholds apply?
Part 1 - Scope and Verify Impact
Explain how you determine whether the issue affects one user, one account, one segment, or many users.
What This Part Should Cover
-
Collect timestamps, HAR files, request IDs, user/account IDs, screenshots, and affected workflows.
-
Check p50/p95/p99 latency, error rate, traffic, and saturation over time.
-
Segment by tenant, geography, browser, device, app version, endpoint, query type, and time of day.
-
Compare affected users with unaffected peers.
-
Decide severity and communication cadence.
Part 2 - Gather Metrics and Localize the Bottleneck
Describe how you analyze server logs, database queries, network latency, search cluster behavior, caches, and background jobs.
What This Part Should Cover
-
Golden signals: latency, traffic, errors, saturation.
-
Client/browser waterfall and CDN/load-balancer metrics.
-
Application traces, logs, queue depth, thread pools, memory, CPU, and dependency calls.
-
Database slow queries, query plans, locks, connection pool, indexes, and data growth.
-
Search cluster query latency, indexing lag, shard health, and cache hit rate.
-
Background jobs, cron schedules, cache eviction, and external dependencies.
Part 3 - Hypotheses, Experiments, and Fixes
Explain how you form hypotheses, test them, and prioritize fixes.
What This Part Should Cover
-
Rank hypotheses by impact, likelihood, and testability.
-
Use safe experiments such as replaying queries, disabling a job, adding an index in staging, warming caches, or routing traffic.
-
Separate mitigation from root-cause fix.
-
Prioritize customer-impact reduction, reversibility, and long-term prevention.
-
Add monitoring and regression tests after the fix.
What a Strong Answer Covers
A strong answer scopes the blast radius first, localizes the problem across tiers, uses data to test hypotheses, communicates clearly with stakeholders, and ships mitigations plus prevention rather than guessing from symptoms.
Follow-up Questions
-
What if only one enterprise account is affected?
-
What if p99 latency spikes but p50 is stable?
-
What if no application code changed but database query plans changed?
-
How would you communicate with the client during investigation?
-
What monitoring would you add afterward?