Troubleshoot a production incident end-to-end

Q: Troubleshoot a production incident end-to-end

This is a System Design interview question from Instacart for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach System Design interview questions?

System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master system design interviews.

Question

Incident Troubleshooting: Intermittent Failures and Elevated Latency in Transfers

Context

A microservices-based banking platform begins experiencing intermittent failures and elevated latency for transfer operations starting at a specific time. Assume you have standard observability and deployment tooling (dashboards for metrics, logs, tracing; feature flags; canary/rollback; cloud infrastructure; message queues) and that transfer requests flow through an API gateway to service(s) that interact with a database and at least one external payment partner.

Task

Describe your end-to-end troubleshooting approach:

Which dashboards, metrics, logs, and traces you would inspect first, and why.
How you would form and test hypotheses to pinpoint the cause.
How you would isolate whether the issue is in the client, network/edge, service/application, database, message queue, or an external dependency.
What short-term, safe mitigations you would apply to limit user impact while investigating.
How you would verify the fix and execute postmortem actions to prevent recurrence.

Troubleshoot a production incident end-to-end

Incident Troubleshooting: Intermittent Failures and Elevated Latency in Transfers

Context

Task

Solution (Locked)

Comments (0)