PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/Behavioral & Leadership/DoorDash

Handle a payment-service incident with resource spikes

Last updated: Mar 29, 2026

Quick Overview

This question evaluates incident response, operational resilience, capacity and traffic management, root-cause analysis, and leadership in high-pressure outage scenarios.

  • hard
  • DoorDash
  • Behavioral & Leadership
  • Software Engineer

Handle a payment-service incident with resource spikes

Company: DoorDash

Role: Software Engineer

Category: Behavioral & Leadership

Difficulty: hard

Interview Round: Technical Screen

You own an internal payment-integration service whose downstream is a financial institution. An incident shows high CPU and memory usage on your service. Most auto-scaling nodes fail to stabilize—new nodes come up and then crash. Upstream traffic is 10× normal. Downstream calls return client timeouts with no additional details. No deployments occurred in the last 24 hours. Walk through: ( 1) immediate mitigation to restore service and contain blast radius (e.g., rate limiting, traffic shedding, circuit breakers, capacity isolation, feature flags), ( 2) a systematic root-cause investigation plan across infrastructure, application, dependencies, and configuration (include what telemetry you’d inspect and why), and ( 3) short- and long-term follow-ups to prevent recurrence (alerting/SLOs, autoscaling policies, backpressure, retry budgets, connection pooling, GC/thread tuning, runbooks, load testing, and capacity planning).

Quick Answer: This question evaluates incident response, operational resilience, capacity and traffic management, root-cause analysis, and leadership in high-pressure outage scenarios.

Solution

## 0) Stabilize the Incident Process (first 2–5 minutes) - Declare incident, assign roles: Incident Commander, Communications, Ops/Engineering owners. - Freeze deploys and config changes; start a shared channel and timeline. - Set goals: stop cascading failures, preserve critical payment flows, reduce error rate and tail latency. ## 1) Immediate Mitigation (containment and rapid stabilization) Priorities: reduce load, fail fast to the unhealthy dependency, and stabilize autoscaling. 1. Apply backpressure and rate limits at the edge - Token bucket rate limiting: cap ingress to a safe value based on current healthy capacity. - Example: baseline is 1k RPS; 10× spike to 10k RPS. Set hard limit to 2k RPS with a short burst (e.g., burst=5k for 30s) while adding more capacity. - Prioritize critical traffic (e.g., auth/capture) over non-critical (e.g., reconciliation). Use header/endpoint-based priority queues. - Return 429 for excess traffic with Retry-After to guide upstream. 2. Shed non-critical work and features - Disable/feature-flag expensive optional paths (heavy fraud enrichment, verbose logging, sync audit writes). Replace with cached or stale data. - Temporarily disable secondary endpoints (status polling, reports) that compete for resources. 3. Fail fast on the downstream and enable circuit breakers - Tighten downstream timeout to a sane bound (e.g., p95 + headroom). If current timeout is 3s but downstream is timing out, try 1–1.5s to prevent thread/socket pile-ups. - Open circuit breaker on persistent failures to stop hammering the financial institution. Fallback: enqueue request to a durable queue for later processing when possible. - Cap concurrent downstream calls (bulkhead): e.g., max 500 in-flight per node; overflow gets queued or 503. 4. Stabilize autoscaling and resource exhaustion - Pause aggressive scale-in/scale-out oscillation: set a higher min replica count; increase cooldowns; use step scaling. - Scale on concurrency/queue depth rather than CPU alone. Add a small buffer of pre-warmed instances. - Increase per-node limits if appropriate (e.g., memory limit headroom) to avoid OOM kills; reduce liveness probe aggressiveness to avoid restart storms. - Ensure connection reuse (keepalive/HTTP2) to avoid ephemeral port/NAT exhaustion. 5. Isolate capacity and tenants - Apply bulkheads per tenant/region to prevent one high-volume client from starving others. - If multi-region, drain traffic to the healthiest region while keeping a safe per-region cap. 6. Communication and customer impact management - Update status with clear user-facing guidance (some requests will be delayed; retry after X seconds). Coordinate with upstream to reduce retries and respect 429. Guardrails while mitigating - Watch p95/p99 latency, error rate, queue depth, JVM/GC or runtime stalls, restart rate. If error rate > threshold (e.g., 5%) persists for 2 intervals, tighten limits. If p95 recovers, gradually raise limits by 10–20% steps. ## 2) Systematic Root-Cause Investigation Plan Work from infra up to application and dependencies; correlate time series with event timelines. A. Infrastructure and platform - Compute/container health: CPU steal, throttling, memory usage, OOM killer logs, container restarts, node pressure (K8s events), disk IO, network errors, saturation. - Autoscaling events: scale in/out timestamps, cooldowns, failed scale-ups, pod pending reasons (insufficient IPs, resource quotas). - Network and LB: LB 4xx/5xx, connection resets, SYN backlog, TLS handshake errors, NAT/ephemeral port exhaustion, eBPF/firewall drops. - Storage/queues: backlog length, enqueue/dequeue rates, DLQ growth. Why: High CPU/mem plus crash-looping suggests OOMs, GC thrash, or thread/socket starvation. Network saturation or port exhaustion can masquerade as downstream timeouts. B. Application/runtime - Concurrency and thread pools: in-flight requests, blocked threads, queue sizes, pool exhaustion. - GC and memory: heap usage, allocation rate, GC pause time, survivor/old gen pressure (JVM) or GC pause metrics (Go/Python equivalents). Look for leaks or large buffers during timeouts. - Connection pools: pool size, wait time, max connections, connection churn, keepalive settings, HTTP/2 max concurrent streams. - Retries and timeouts: retry count per request, backoff/jitter, total timeout budget across hops. - Logs and traces: long spans around downstream call; where time is spent (DNS, TLS, connect, TTFB, read). Check for head-of-line blocking. Why: Timeout storms often create retry storms, thread/connection starvation, and memory blow-ups from queued work. C. Dependencies (downstream financial institution and other services) - Downstream health: their status, latency percentiles, rate limits, recent changes, certificate/credential issues, IP allowlists. - DNS/TLS: resolution latency, SERVFAIL spikes, certificate expiry/rotation, mTLS misconfig. - Schema/contract: any new fields, payload bloat, compression changes increasing CPU. Why: If downstream is degraded or restricting, we must cap our concurrency and adapt timeouts; DNS/TLS issues create widespread timeouts. D. Configuration, data, and traffic patterns - Traffic source analysis: which upstream/tenant surged? Is it legit (promo) or a retry loop? Compare unique request IDs vs duplicates. - Config drift: feature flags, cron jobs, batch jobs, limits changed outside deploys (e.g., secrets rotation, WAF/rate-limit rules, ASG policies). - Payload sizes: sudden increase in request/response size leading to CPU/memory pressure. Why: Many incidents are caused by config/ops changes unrelated to code deploys. E. Triage hypotheses to test quickly - Retry storm: upstream ignoring 429 and retrying aggressively → confirm via logs (same id repeated), high duplicate ratio. - Connection pool starvation: max connections too low → high pool wait time, many sockets in TIME_WAIT. - NAT/ephemeral port exhaustion: many short-lived connections → connect timeouts, high FIN_WAIT/TIME_WAIT. - GC thrash/memory leak: heap near limit, long GC pauses prior to OOM/restarts. - Downstream rate-limit or outage: their 5xx/timeout rate up, our circuit not tripping early. Instrumentation to inspect - Metrics: RPS, success/error by endpoint/tenant, p50/p95/p99 latency, saturation (CPU, memory, threads), queue depth, retry counts, downstream call latency breakdown (DNS/connect/TLS/TTFB/read), connection pool metrics, GC. - Logs: structured request logs with correlation IDs, errors with causes/timeouts, OOM killer messages, probe failures. - Traces: critical path spans; identify where time is spent; look for N+1 patterns. ## 3) Preventing Recurrence Split into short-term (days) and long-term (weeks/months). Short-term hardening - Backpressure and limits - Enforce global and per-tenant rate limits with token-bucket/leaky-bucket at the edge. - Concurrency limits and queues per dependency (bulkheads). Example: cap downstream at 3× steady-state concurrency with overflow → fast 503 + Retry-After. - Retry and timeout budgets - Define timeout budget per hop: if end-to-end SLO is 2s, allocate 1.2s to downstream, 300ms to upstream processing, 500ms headroom. - Retry budget tied to error budget: if SLO allows 1% errors, allow at most 1 retry with exponential backoff + jitter, and only on idempotent, safe-to-retry errors. - Circuit breakers and fallbacks - Automatic open on high failure rate/latency; provide async fallback (enqueue) for operations that can be deferred. - Connection and thread pool tuning - Right-size pools to avoid both under- and over-saturation; enable keepalive/HTTP2; set max concurrent streams; watch pool wait time. - Autoscaling guardrails - Scale on concurrency/queue depth; increase min replicas; disable scale-in during incidents; pre-warm instances to avoid cold starts. - Probes and graceful handling - Use readiness/startup probes to keep unhealthy pods out of rotation; increase liveness timeouts to avoid flap; implement graceful shutdown and in-flight draining. - Runbooks and feature flags - Create runbooks for rate limit changes, circuit breaker toggles, and traffic drains. Ensure kill switches exist for expensive features. Long-term resilience and capacity - SLOs, alerting, and observability - Define SLOs for availability and latency (e.g., 99.9% under 1s). Alert on error budget burn rates, not raw CPU. - Add golden signals per dependency, including downstream latency breakdown, pool wait, retry rate, and queue depth. - Load and failure testing - Regularly load-test to 5–10× baseline; include dependency timeouts to test backpressure and circuit breakers. - Chaos experiments: inject downstream slowness and timeouts; verify fail-fast and shedding work. - Architectural decoupling - Move heavy/deferrable operations to async queues; ensure idempotency keys and exactly-once or at-least-once semantics with de-dup. - Implement write-ahead log/outbox pattern for reliability. - Capacity planning - Periodic forecasts; maintain safety margin (e.g., 2× headroom during peak events). Pre-warm for known spikes. - Dependency contracts and coordination - Formalize rate limits and quotas with downstream; implement adaptive concurrency based on observed latency (e.g., AIMD). - Shared runbooks and test environments with realistic limits; set up mutual circuit-breaker visibility. - Security and networking hygiene - Monitor TLS/DNS health, cert rotations; implement connection reuse to avoid NAT exhaustion; tune kernel params where needed. - Performance and memory - Profile hotspots; reduce payload sizes via compression with CPU budgets; tune GC (heap sizes, pause goals) or optimize allocations. Concrete examples and guardrails - Token bucket sizing: if steady RPS is 1k and safe-to-handle burst is 3k for 60s, set rate=1.5k tokens/s, burst=3k; monitor p95 latency; if p95 > SLO for 2 mins, reduce rate by 10%. - Concurrency cap: if downstream p95 rises above 800ms at >800 concurrent, cap at 700 concurrent; observe tail latency and errors—adjust with AIMD (add 50 when healthy, cut 30% on breach). - Retry policy: only 1 retry with backoff 200ms–800ms, jitter 20%, retry-on: 502/503/504/timeouts; never retry on 4xx (except 409/429 with backoff respecting Retry-After). - Timeout budget: e2e 2000ms → upstream processing 300ms, downstream connect+TLS 150ms, downstream read 900ms, 650ms buffer for queuing and variability. Common pitfalls to avoid - Increasing timeouts blindly (worsens resource contention). - Unbounded retries (amplifies load 10× during outages). - CPU-only autoscaling (ignores saturation from I/O and downstream). - Overly aggressive liveness probes causing restart storms. - Single shared connection pool for all tenants (no isolation). Validation after fixes - Run post-incident load test approximating 10× spike with downstream throttled; verify shedding, circuit-breaker behavior, and recovery. - Audit dashboards and alerts; conduct a blameless postmortem with clear action items, owners, and due dates. This approach restores service quickly (fail-fast + shedding + stabilization), diagnoses the layered causes (infra, runtime, dependency, config), and builds durable resilience (backpressure, budgets, scaling, and observability) to prevent recurrence.

Related Interview Questions

  • How would you mentor junior teammates? - DoorDash (medium)
  • Describe a Project End-to-End - DoorDash (medium)
  • How would you mentor as a senior? - DoorDash (easy)
  • How do you discuss mistakes and trade-offs? - DoorDash (easy)
  • Walk Through an ML Project - DoorDash (easy)
DoorDash logo
DoorDash
Jul 17, 2025, 12:00 AM
Software Engineer
Technical Screen
Behavioral & Leadership
6
0

Incident Response and Resilience: Payment Integration Service Outage

Context

You own an internal payment-integration service that synchronously calls a downstream financial institution. An incident is in progress:

  • CPU and memory are high on your service; most auto-scaled nodes fail to stabilize (come up, then crash).
  • Upstream traffic is 10× normal.
  • Downstream calls mostly return client timeouts with no additional details.
  • No deployments in the last 24 hours.

Assume a typical microservices setup (container orchestration, metrics/logging/tracing available) and that requests are idempotent for safe retries where noted.

Task

Walk through the following:

  1. Immediate mitigation to restore service and contain blast radius (consider rate limiting, traffic shedding, circuit breakers, capacity isolation, feature flags).
  2. A systematic root-cause investigation plan across infrastructure, application, dependencies, and configuration (include what telemetry you’d inspect and why).
  3. Short- and long-term follow-ups to prevent recurrence (alerting/SLOs, autoscaling policies, backpressure, retry budgets, connection pooling, GC/thread tuning, runbooks, load testing, capacity planning).

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Behavioral & Leadership•More DoorDash•More Software Engineer•DoorDash Software Engineer•DoorDash Behavioral & Leadership•Software Engineer Behavioral & Leadership
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.