Handle a payment-service incident with resource spikes

Q: Handle a payment-service incident with resource spikes

This is a Behavioral & Leadership interview question from DoorDash for Software Engineer roles. View the full question and solution on PracHub.

Q: How do I approach Behavioral & Leadership interview questions?

Behavioral & Leadership questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master behavioral & leadership interviews.

Question

Incident Response and Resilience: Payment Integration Service Outage

Context

You own an internal payment-integration service that synchronously calls a downstream financial institution. An incident is in progress:

CPU and memory are high on your service; most auto-scaled nodes fail to stabilize (come up, then crash).
Upstream traffic is 10× normal.
Downstream calls mostly return client timeouts with no additional details.
No deployments in the last 24 hours.

Assume a typical microservices setup (container orchestration, metrics/logging/tracing available) and that requests are idempotent for safe retries where noted.

Task

Walk through the following:

Immediate mitigation to restore service and contain blast radius (consider rate limiting, traffic shedding, circuit breakers, capacity isolation, feature flags).
A systematic root-cause investigation plan across infrastructure, application, dependencies, and configuration (include what telemetry you’d inspect and why).
Short- and long-term follow-ups to prevent recurrence (alerting/SLOs, autoscaling policies, backpressure, retry budgets, connection pooling, GC/thread tuning, runbooks, load testing, capacity planning).

Handle a payment-service incident with resource spikes

Incident Response and Resilience: Payment Integration Service Outage

Context

Task

Solution (Locked)

Comments (0)