How do I approach Behavioral & Leadership interview questions?

Behavioral & Leadership questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master behavioral & leadership interviews.

What difficulty level is this interview question?

This is a hard difficulty Behavioral & Leadership question, commonly asked during Technical Screen rounds at DoorDash.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at DoorDash during technical interviews.

Handle a payment-service incident with resource spikes

Last updated: Mar 29, 2026

Quick Overview

This question evaluates incident response, operational resilience, capacity and traffic management, root-cause analysis, and leadership in high-pressure outage scenarios.

Handle a payment-service incident with resource spikes

Company: DoorDash

Role: Software Engineer

Category: Behavioral & Leadership

Difficulty: hard

Interview Round: Technical Screen

You own an internal payment-integration service whose downstream is a financial institution. An incident shows high CPU and memory usage on your service. Most auto-scaling nodes fail to stabilize—new nodes come up and then crash. Upstream traffic is 10× normal. Downstream calls return client timeouts with no additional details. No deployments occurred in the last 24 hours. Walk through: ( 1) immediate mitigation to restore service and contain blast radius (e.g., rate limiting, traffic shedding, circuit breakers, capacity isolation, feature flags), ( 2) a systematic root-cause investigation plan across infrastructure, application, dependencies, and configuration (include what telemetry you’d inspect and why), and ( 3) short- and long-term follow-ups to prevent recurrence (alerting/SLOs, autoscaling policies, backpressure, retry budgets, connection pooling, GC/thread tuning, runbooks, load testing, and capacity planning).

Quick Answer: This question evaluates incident response, operational resilience, capacity and traffic management, root-cause analysis, and leadership in high-pressure outage scenarios.

Solution

## 0) Stabilize the Incident Process (first 2–5 minutes) - Declare incident, assign roles: Incident Commander, Communications, Ops/Engineering owners. - Freeze deploys and config changes; start a shared channel and timeline. - Set goals: stop cascading failures, preserve critical payment flows, reduce error rate and tail latency. ## 1) Immediate Mitigation (containment and rapid stabilization) Priorities: reduce load, fail fast to the unhealthy dependency, and stabilize autoscaling. 1. Apply backpressure and rate limits at the edge - Token bucket rate limiting: cap ingress to a safe value based on current healthy capacity. - Example: baseline is 1k RPS; 10× spike to 10k RPS. Set hard limit to 2k RPS with a short burst (e.g., burst=5k for 30s) while adding more capacity. - Prioritize critical traffic (e.g., auth/capture) over non-critical (e.g., reconciliation). Use header/endpoint-based priority queues. - Return 429 for excess traffic with Retry-After to guide upstream. 2. Shed non-critical work and features - Disable/feature-flag expensive optional paths (heavy fraud enrichment, verbose logging, sync audit writes). Replace with cached or stale data. - Temporarily disable secondary endpoints (status polling, reports) that compete for resources. 3. Fail fast on the downstream and enable circuit breakers - Tighten downstream timeout to a sane bound (e.g., p95 + headroom). If current timeout is 3s but downstream is timing out, try 1–1.5s to prevent thread/socket pile-ups. - Open circuit breaker on persistent failures to stop hammering the financial institution. Fallback: enqueue request to a durable queue for later processing when possible. - Cap concurrent downstream calls (bulkhead): e.g., max 500 in-flight per node; overflow gets queued or 503. 4. Stabilize autoscaling and resource exhaustion - Pause aggressive scale-in/scale-out oscillation: set a higher min replica count; increase cooldowns; use step scaling. - Scale on concurrency/queue depth rather than CPU alone. Add a small buffer of pre-warmed instances. - Increase per-node limits if appropriate (e.g., memory limit headroom) to avoid OOM kills; reduce liveness probe aggressiveness to avoid restart storms. - Ensure connection reuse (keepalive/HTTP2) to avoid ephemeral port/NAT exhaustion. 5. Isolate capacity and tenants - Apply bulkheads per tenant/region to prevent one high-volume client from starving others. - If multi-region, drain traffic to the healthiest region while keeping a safe per-region cap. 6. Communication and customer impact management - Update status with clear user-facing guidance (some requests will be delayed; retry after X seconds). Coordinate with upstream to reduce retries and respect 429. Guardrails while mitigating - Watch p95/p99 latency, error rate, queue depth, JVM/GC or runtime stalls, restart rate. If error rate > threshold (e.g., 5%) persists for 2 intervals, tighten limits. If p95 recovers, gradually raise limits by 10–20% steps. ## 2) Systematic Root-Cause Investigation Plan Work from infra up to application and dependencies; correlate time series with event timelines. A. Infrastructure and platform - Compute/container health: CPU steal, throttling, memory usage, OOM killer logs, container restarts, node pressure (K8s events), disk IO, network errors, saturation. - Autoscaling events: scale in/out timestamps, cooldowns, failed scale-ups, pod pending reasons (insufficient IPs, resource quotas). - Network and LB: LB 4xx/5xx, connection resets, SYN backlog, TLS handshake errors, NAT/ephemeral port exhaustion, eBPF/firewall drops. - Storage/queues: backlog length, enqueue/dequeue rates, DLQ growth. Why: High CPU/mem plus crash-looping suggests OOMs, GC thrash, or thread/socket starvation. Network saturation or port exhaustion can masquerade as downstream timeouts. B. Application/runtime - Concurrency and thread pools: in-flight requests, blocked threads, queue sizes, pool exhaustion. - GC and memory: heap usage, allocation rate, GC pause time, survivor/old gen pressure (JVM) or GC pause metrics (Go/Python equivalents). Look for leaks or large buffers during timeouts. - Connection pools: pool size, wait time, max connections, connection churn, keepalive settings, HTTP/2 max concurrent streams. - Retries and timeouts: retry count per request, backoff/jitter, total timeout budget across hops. - Logs and traces: long spans around downstream call; where time is spent (DNS, TLS, connect, TTFB, read). Check for head-of-line blocking. Why: Timeout storms often create retry storms, thread/connection starvation, and memory blow-ups from queued work. C. Dependencies (downstream financial institution and other services) - Downstream health: their status, latency percentiles, rate limits, recent changes, certificate/credential issues, IP allowlists. - DNS/TLS: resolution latency, SERVFAIL spikes, certificate expiry/rotation, mTLS misconfig. - Schema/contract: any new fields, payload bloat, compression changes increasing CPU. Why: If downstream is degraded or restricting, we must cap our concurrency and adapt timeouts; DNS/TLS issues create widespread timeouts. D. Configuration, data, and traffic patterns - Traffic source analysis: which upstream/tenant surged? Is it legit (promo) or a retry loop? Compare unique request IDs vs duplicates. - Config drift: feature flags, cron jobs, batch jobs, limits changed outside deploys (e.g., secrets rotation, WAF/rate-limit rules, ASG policies). - Payload sizes: sudden increase in request/response size leading to CPU/memory pressure. Why: Many incidents are caused by config/ops changes unrelated to code deploys. E. Triage hypotheses to test quickly - Retry storm: upstream ignoring 429 and retrying aggressively → confirm via logs (same id repeated), high duplicate ratio. - Connection pool starvation: max connections too low → high pool wait time, many sockets in TIME_WAIT. - NAT/ephemeral port exhaustion: many short-lived connections → connect timeouts, high FIN_WAIT/TIME_WAIT. - GC thrash/memory leak: heap near limit, long GC pauses prior to OOM/restarts. - Downstream rate-limit or outage: their 5xx/timeout rate up, our circuit not tripping early. Instrumentation to inspect - Metrics: RPS, success/error by endpoint/tenant, p50/p95/p99 latency, saturation (CPU, memory, threads), queue depth, retry counts, downstream call latency breakdown (DNS/connect/TLS/TTFB/read), connection pool metrics, GC. - Logs: structured request logs with correlation IDs, errors with causes/timeouts, OOM killer messages, probe failures. - Traces: critical path spans; identify where time is spent; look for N+1 patterns. ## 3) Preventing Recurrence Split into short-term (days) and long-term (weeks/months). Short-term hardening - Backpressure and limits - Enforce global and per-tenant rate limits with token-bucket/leaky-bucket at the edge. - Concurrency limits and queues per dependency (bulkheads). Example: cap downstream at 3× steady-state concurrency with overflow → fast 503 + Retry-After. - Retry and timeout budgets - Define timeout budget per hop: if end-to-end SLO is 2s, allocate 1.2s to downstream, 300ms to upstream processing, 500ms headroom. - Retry budget tied to error budget: if SLO allows 1% errors, allow at most 1 retry with exponential backoff + jitter, and only on idempotent, safe-to-retry errors. - Circuit breakers and fallbacks - Automatic open on high failure rate/latency; provide async fallback (enqueue) for operations that can be deferred. - Connection and thread pool tuning - Right-size pools to avoid both under- and over-saturation; enable keepalive/HTTP2; set max concurrent streams; watch pool wait time. - Autoscaling guardrails - Scale on concurrency/queue depth; increase min replicas; disable scale-in during incidents; pre-warm instances to avoid cold starts. - Probes and graceful handling - Use readiness/startup probes to keep unhealthy pods out of rotation; increase liveness timeouts to avoid flap; implement graceful shutdown and in-flight draining. - Runbooks and feature flags - Create runbooks for rate limit changes, circuit breaker toggles, and traffic drains. Ensure kill switches exist for expensive features. Long-term resilience and capacity - SLOs, alerting, and observability - Define SLOs for availability and latency (e.g., 99.9% under 1s). Alert on error budget burn rates, not raw CPU. - Add golden signals per dependency, including downstream latency breakdown, pool wait, retry rate, and queue depth. - Load and failure testing - Regularly load-test to 5–10× baseline; include dependency timeouts to test backpressure and circuit breakers. - Chaos experiments: inject downstream slowness and timeouts; verify fail-fast and shedding work. - Architectural decoupling - Move heavy/deferrable operations to async queues; ensure idempotency keys and exactly-once or at-least-once semantics with de-dup. - Implement write-ahead log/outbox pattern for reliability. - Capacity planning - Periodic forecasts; maintain safety margin (e.g., 2× headroom during peak events). Pre-warm for known spikes. - Dependency contracts and coordination - Formalize rate limits and quotas with downstream; implement adaptive concurrency based on observed latency (e.g., AIMD). - Shared runbooks and test environments with realistic limits; set up mutual circuit-breaker visibility. - Security and networking hygiene - Monitor TLS/DNS health, cert rotations; implement connection reuse to avoid NAT exhaustion; tune kernel params where needed. - Performance and memory - Profile hotspots; reduce payload sizes via compression with CPU budgets; tune GC (heap sizes, pause goals) or optimize allocations. Concrete examples and guardrails - Token bucket sizing: if steady RPS is 1k and safe-to-handle burst is 3k for 60s, set rate=1.5k tokens/s, burst=3k; monitor p95 latency; if p95 > SLO for 2 mins, reduce rate by 10%. - Concurrency cap: if downstream p95 rises above 800ms at >800 concurrent, cap at 700 concurrent; observe tail latency and errors—adjust with AIMD (add 50 when healthy, cut 30% on breach). - Retry policy: only 1 retry with backoff 200ms–800ms, jitter 20%, retry-on: 502/503/504/timeouts; never retry on 4xx (except 409/429 with backoff respecting Retry-After). - Timeout budget: e2e 2000ms → upstream processing 300ms, downstream connect+TLS 150ms, downstream read 900ms, 650ms buffer for queuing and variability. Common pitfalls to avoid - Increasing timeouts blindly (worsens resource contention). - Unbounded retries (amplifies load 10× during outages). - CPU-only autoscaling (ignores saturation from I/O and downstream). - Overly aggressive liveness probes causing restart storms. - Single shared connection pool for all tenants (no isolation). Validation after fixes - Run post-incident load test approximating 10× spike with downstream throttled; verify shedding, circuit-breaker behavior, and recovery. - Audit dashboards and alerts; conduct a blameless postmortem with clear action items, owners, and due dates. This approach restores service quickly (fail-fast + shedding + stabilization), diagnoses the layered causes (infra, runtime, dependency, config), and builds durable resilience (backpressure, budgets, scaling, and observability) to prevent recurrence.

DoorDash

Jul 17, 2025, 12:00 AM

Software Engineer

Technical Screen

Behavioral & Leadership

Incident Response and Resilience: Payment Integration Service Outage

Context

You own an internal payment-integration service that synchronously calls a downstream financial institution. An incident is in progress:

CPU and memory are high on your service; most auto-scaled nodes fail to stabilize (come up, then crash).
Upstream traffic is 10× normal.
Downstream calls mostly return client timeouts with no additional details.
No deployments in the last 24 hours.

Assume a typical microservices setup (container orchestration, metrics/logging/tracing available) and that requests are idempotent for safe retries where noted.

Task

Walk through the following:

Immediate mitigation to restore service and contain blast radius (consider rate limiting, traffic shedding, circuit breakers, capacity isolation, feature flags).
A systematic root-cause investigation plan across infrastructure, application, dependencies, and configuration (include what telemetry you’d inspect and why).
Short- and long-term follow-ups to prevent recurrence (alerting/SLOs, autoscaling policies, backpressure, retry budgets, connection pooling, GC/thread tuning, runbooks, load testing, capacity planning).

Solution

Show

Comments (0)

Loading comments...

Browse More Questions

More Behavioral & Leadership•More DoorDash•More Software Engineer•DoorDash Software Engineer•DoorDash Behavioral & Leadership•Software Engineer Behavioral & Leadership

Handle a payment-service incident with resource spikes

Last updated: Mar 29, 2026

Quick Overview

This question evaluates incident response, operational resilience, capacity and traffic management, root-cause analysis, and leadership in high-pressure outage scenarios.

Handle a payment-service incident with resource spikes

Company: DoorDash

Role: Software Engineer

Category: Behavioral & Leadership

Difficulty: hard

Interview Round: Technical Screen

Quick Answer: This question evaluates incident response, operational resilience, capacity and traffic management, root-cause analysis, and leadership in high-pressure outage scenarios.

Solution

DoorDash

Jul 17, 2025, 12:00 AM

Software Engineer

Technical Screen

Behavioral & Leadership

Incident Response and Resilience: Payment Integration Service Outage

Context

You own an internal payment-integration service that synchronously calls a downstream financial institution. An incident is in progress:

CPU and memory are high on your service; most auto-scaled nodes fail to stabilize (come up, then crash).
Upstream traffic is 10× normal.
Downstream calls mostly return client timeouts with no additional details.
No deployments in the last 24 hours.

Assume a typical microservices setup (container orchestration, metrics/logging/tracing available) and that requests are idempotent for safe retries where noted.

Task

Walk through the following:

Immediate mitigation to restore service and contain blast radius (consider rate limiting, traffic shedding, circuit breakers, capacity isolation, feature flags).
A systematic root-cause investigation plan across infrastructure, application, dependencies, and configuration (include what telemetry you’d inspect and why).
Short- and long-term follow-ups to prevent recurrence (alerting/SLOs, autoscaling policies, backpressure, retry budgets, connection pooling, GC/thread tuning, runbooks, load testing, capacity planning).

Solution

Show

Comments (0)

Loading comments...

Browse More Questions

More Behavioral & Leadership•More DoorDash•More Software Engineer•DoorDash Software Engineer•DoorDash Behavioral & Leadership•Software Engineer Behavioral & Leadership