Web Application Performance Debugging

Q: Web Application Performance Debugging

Practice troubleshooting intermittent slowness in a multi-tier web application. The solution covers impact scoping, client and network checks, app traces, database and search diagnostics, cache behavior, background jobs, hypothesis testing, mitigation, stakeholder communication, and prevention.

Q: How do I approach Product / Decision Making interview questions?

Product / Decision Making questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master product / decision making interviews.

Q: What difficulty level is this interview question?

This is a medium difficulty Product / Decision Making question, commonly asked during Technical Screen rounds at Google.

Q: What role is this question designed for?

This question is commonly asked for Product Manager candidates at Google during technical interviews.

Question

Troubleshooting Prompt: Intermittent Slowness in a Web Application

A client reports that a web application is intermittently very slow when generating reports or performing searches. No code has changed in the last five months.

Assume a typical multi-tier architecture: browser, CDN or load balancer, application services, database and search cluster, caches, background jobs, and external dependencies.

Explain your step-by-step troubleshooting approach.

Constraints & Assumptions

Treat this as an incident and diagnosis problem, not just a generic performance checklist.
Intermittent slowness may come from data growth, traffic shifts, infrastructure changes, cache behavior, database plans, search indexing, background jobs, or network issues even when application code has not changed.
The answer should show how you scope impact, localize the bottleneck, form hypotheses, test safely, and prioritize fixes.

Clarifying Questions to Ask

Which actions are slow: reports, searches, login, page load, exports, or all requests?
What timestamps, user IDs, accounts, geographies, browsers, and query parameters are affected?
Is the slowness in time to first byte, client rendering, backend processing, database query, search query, or download?
Did data volume, traffic, configuration, infrastructure, dependencies, or scheduled jobs change recently?
What SLOs or customer-impact thresholds apply?

Part 1 - Scope and Verify Impact

Explain how you determine whether the issue affects one user, one account, one segment, or many users.

What This Part Should Cover

Collect timestamps, HAR files, request IDs, user/account IDs, screenshots, and affected workflows.
Check p50/p95/p99 latency, error rate, traffic, and saturation over time.
Segment by tenant, geography, browser, device, app version, endpoint, query type, and time of day.
Compare affected users with unaffected peers.
Decide severity and communication cadence.

Part 2 - Gather Metrics and Localize the Bottleneck

Describe how you analyze server logs, database queries, network latency, search cluster behavior, caches, and background jobs.

What This Part Should Cover

Golden signals: latency, traffic, errors, saturation.
Client/browser waterfall and CDN/load-balancer metrics.
Application traces, logs, queue depth, thread pools, memory, CPU, and dependency calls.
Database slow queries, query plans, locks, connection pool, indexes, and data growth.
Search cluster query latency, indexing lag, shard health, and cache hit rate.
Background jobs, cron schedules, cache eviction, and external dependencies.

Part 3 - Hypotheses, Experiments, and Fixes

Explain how you form hypotheses, test them, and prioritize fixes.

What This Part Should Cover

Rank hypotheses by impact, likelihood, and testability.
Use safe experiments such as replaying queries, disabling a job, adding an index in staging, warming caches, or routing traffic.
Separate mitigation from root-cause fix.
Prioritize customer-impact reduction, reversibility, and long-term prevention.
Add monitoring and regression tests after the fix.

What a Strong Answer Covers

A strong answer scopes the blast radius first, localizes the problem across tiers, uses data to test hypotheses, communicates clearly with stakeholders, and ships mitigations plus prevention rather than guessing from symptoms.

Follow-up Questions

What if only one enterprise account is affected?
What if p99 latency spikes but p50 is stable?
What if no application code changed but database query plans changed?
How would you communicate with the client during investigation?
What monitoring would you add afterward?

Web Application Performance Debugging

Quick Overview

Web Application Performance Debugging

Troubleshooting Prompt: Intermittent Slowness in a Web Application

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 - Scope and Verify Impact

What This Part Should Cover

Part 2 - Gather Metrics and Localize the Bottleneck

What This Part Should Cover

Part 3 - Hypotheses, Experiments, and Fixes

What This Part Should Cover

What a Strong Answer Covers

Follow-up Questions

Write your answer

Web Application Performance Debugging

Quick Overview

Web Application Performance Debugging

Troubleshooting Prompt: Intermittent Slowness in a Web Application

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 - Scope and Verify Impact

What This Part Should Cover

Part 2 - Gather Metrics and Localize the Bottleneck

What This Part Should Cover

Part 3 - Hypotheses, Experiments, and Fixes

What This Part Should Cover

What a Strong Answer Covers

Follow-up Questions

Write your answer