Incident Triage Prompt: Customer-Reported Cloud Connection Issue
You are the on-call Product Manager partnering with Support, SRE, and Engineering during an onsite incident simulation.
Walk through how you would triage a customer-reported cloud connection issue. Specifically cover:
-
Initial information you need to collect and why.
-
Key hypotheses you would form and test to isolate the problem.
-
Diagnostic tools, metrics, and logs you would inspect.
-
How you would communicate status and next steps to internal and external stakeholders.
Assume the issue is time-sensitive and may involve multiple customers, regions, or deployment environments.
Constraints & Assumptions
-
Treat this as an incident-management and product-judgment problem.
-
Stabilize customer impact first, then drive root-cause analysis.
-
Scope impact by customer, region, endpoint, network path, auth context, deployment, and dependency.
-
Communicate early on a predictable cadence, even before root cause is known.
Clarifying Questions to Ask
-
Is the issue complete outage, intermittent timeout, high latency, auth failure, or degraded performance?
-
Which customers, regions, environments, endpoints, and protocols are affected?
-
When did it start, and were there recent deploys, config changes, certificate changes, or network changes?
-
Are there request IDs, logs, error messages, status codes, or reproducible calls?
-
Is there an available workaround or failover path?
Part 1 - Intake and Scoping
Describe the initial information to collect and why.
What This Part Should Cover
-
Customer impact, severity, affected accounts, workflows, and business risk.
-
Timestamps, regions, environments, endpoints, protocols, and connectivity mode.
-
Error messages, request IDs, auth context, client version, network path, and recent changes.
-
Severity classification and incident owner.
Part 2 - Hypotheses and Diagnostics
List hypotheses and diagnostic tools, metrics, and logs to inspect.
What This Part Should Cover
-
Client/network issues, DNS, TLS, firewall/proxy, VPN/PrivateLink, auth/permissions, rate limits, service errors, dependency outages, deploy/config regression, and regional incidents.
-
Metrics such as error rate, latency, connection failures, saturation, request volume, and success by region.
-
Logs, traces, request IDs, load balancer metrics, network telemetry, deployment timeline, and status pages.
Part 3 - Communication and Follow-Up
Explain how you communicate status and next steps internally and externally.
What This Part Should Cover
-
Incident channel, roles, cadence, customer updates, support macros, executive summary, and escalation.
-
Workaround communication.
-
Decision log and timeline.
-
Post-incident review, root cause, corrective actions, and prevention.
What a Strong Answer Covers
A strong answer scopes impact fast, tests layered hypotheses, drives mitigation, keeps stakeholders informed, and turns the incident into follow-up actions that reduce future recurrence.
Follow-up Questions
-
What would you do if only one enterprise customer is affected?
-
What if Support reports timeouts but SRE dashboards look normal?
-
How would you decide whether to page another team?
-
What do you say externally before root cause is known?
-
What belongs in the postmortem?