Resilient API Aggregation And Operational Debugging
Asked of: Software Engineer
Last updated
What's being tested
DoorDash is probing whether you can build and debug resilient service-to-service aggregation under real production constraints: partial failures, latency budgets, retries, concurrency limits, caches, and routing behavior. A strong Software Engineer answer shows you can turn multiple downstream calls into one reliable API response without creating retry storms or hiding failures. Interviewers are also looking for operational maturity: how you reproduce incidents, inspect logs/traces/metrics, isolate root cause, and add tests or guardrails so the same failure does not recur. This matters because marketplace systems depend on many microservices—consumer, merchant, dasher, dispatch, pricing, promotions—and a fragile aggregator can turn one slow dependency into a user-visible outage.
Core knowledge
-
API aggregation usually means a request fan-out to several downstream services and merges results into one response. Parallel fan-out changes latency from roughly to , but increases concurrency, timeout coordination, and partial-failure complexity.
-
Concurrency primitives should fit the language:
CompletableFuturein Java,asyncio.gatherin Python,Promise.allSettledin JavaScript, goroutines pluscontext.Contextin Go. Use bounded concurrency when fan-out can grow; unbounded parallelism can exhaust threads, sockets, or connection pools. -
Timeouts need both per-call and overall budgets. If the API has a
500msSLA, you might reserve50msfor merge/serialization and split450msacross downstream calls. Always propagate cancellation so losing work stops after the client no longer needs it. -
Failure policies should be explicit.
WAIT_ALLreturns after all calls finish or timeout, useful when partial data is acceptable.FAIL_FASTcancels outstanding work after a critical dependency fails, useful when the response is invalid without that dependency. -
Retries help only for transient failures such as
HTTP 503, connection resets, or short timeouts. Use capped exponential backoff with jitter: . Do not retry non-idempotent writes unless you have an idempotency key. -
Retry amplification is a common distributed-systems bug. If one frontend request fans out to
5services and each retries3times, the backend may see up to15calls per user request. Add retry budgets, per-service limits, and circuit breakers. -
Circuit breakers prevent repeatedly calling a known-bad dependency. A simple breaker has
closed,open, andhalf-openstates, using rolling failure rate or latency thresholds. Pair it with graceful degradation, not silent data corruption. -
Load balancing affects both reliability and debuggability. Round-robin is simple and fair for similar hosts; least-connections adapts to variable request cost; consistent hashing preserves cache locality. Bad routing can overload one instance while aggregate fleet metrics look healthy.
-
Caching improves latency and protects downstream services, but adds failure modes: stale values, cache stampedes, hot keys, negative-cache poisoning, and inconsistent invalidation. Common mitigations include TTL jitter, request coalescing, stale-while-revalidate, and per-key locks in
Redisor in-process caches. -
Observability should connect a single user request across services. Use correlation IDs, structured logs, distributed tracing via
OpenTelemetry, and metrics likerequest_rate,error_rate,p50,p95,p99, timeout count, retry count, cache hit rate, and downstream saturation. -
Debugging production incidents should follow a disciplined loop: define the symptom, bound the blast radius, compare healthy versus unhealthy paths, form hypotheses, validate with data, mitigate first, then root-cause. Avoid changing multiple variables at once during mitigation.
-
Test coverage should include deterministic unit tests for merge logic, fake downstream services for timeout/retry behavior, concurrency tests for cancellation/races, and integration tests for partial failure. For legacy modules, add characterization tests before refactoring behavior.
Worked example
For Build an API aggregator with concurrency and retries, start by clarifying the contract: “Which downstream calls are required versus optional? What is the overall latency budget? Are requests read-only and safe to retry? Should partial responses include error metadata?” Then state assumptions, such as three downstream HTTP services, a 500ms overall timeout, and read-only idempotent calls.
A strong answer can be organized around four pillars: concurrent fan-out, timeout propagation, configurable failure policy, and retry control. For concurrent fan-out, describe launching one future per dependency with a bounded executor or async runtime, then merging results into a response object. For timeouts, use a parent deadline and derive per-call deadlines, ensuring cancellation propagates to outstanding futures when FAIL_FAST triggers.
For retries, propose retrying only transient errors with capped exponential backoff and jitter, while respecting the remaining request deadline. For failure policy, define WAIT_ALL as “collect successes and typed failures until the overall deadline,” and FAIL_FAST as “cancel siblings when a required dependency fails.” A concrete tradeoff to flag: aggressive retries can improve success rate but worsen tail latency and overload a degraded dependency, so retries should be limited by attempt count, deadline, and circuit-breaker state.
Close by saying you would add tests using fake services that fail once then recover, hang until timeout, return permanent 400 errors, and verify that cancellation, retry count, and partial-response semantics are correct. If you had more time, you would add metrics for per-dependency latency, retries, timeouts, and result quality so production behavior can be debugged without reading code.
A second angle
For Debug a cache incident end-to-end, the same resilience concept appears as an operational debugging problem rather than a greenfield design problem. The first move is to quantify the symptom: did p99 latency spike, did error rate increase, did downstream database load jump, or did users see stale/incorrect assignment data? Then compare cache metrics—hit rate, miss rate, evictions, hot keys, Redis CPU, connection count, and timeout rate—against the incident window.
The design instincts transfer: cache failure should degrade predictably, not cascade into a database overload or return corrupt data. Instead of discussing WAIT_ALL versus FAIL_FAST, you might discuss stale-while-revalidate versus bypassing cache, or whether negative caching caused valid entities to disappear temporarily. The best answer ends with both an immediate mitigation, such as disabling a bad key pattern or increasing TTL jitter, and a prevention step, such as adding cache-hit-rate alerts and load tests for cold-cache behavior.
Common pitfalls
Pitfall: Treating retries as a universal fix.
A tempting answer is “retry failed calls three times” without distinguishing transient failures from permanent ones. A better answer says which status codes are retryable, caps retries by deadline, adds jitter, and explains how to avoid retry amplification during downstream degradation.
Pitfall: Jumping to root cause before proving the symptom.
In debugging prompts, candidates often say “it’s probably the cache” or “the load balancer is uneven” too early. Land better by first naming the observable evidence you would gather: request IDs, traces, per-host traffic, cache hit rate, recent deploys, config changes, and healthy-versus-unhealthy comparisons.
Pitfall: Designing only the happy-path aggregator.
Some solutions show parallel calls and a merge function but skip cancellation, partial responses, timeouts, and testability. Interviewers want to see failure semantics as part of the API contract: what happens when one service is slow, wrong, unavailable, or returns after the overall deadline?
Connections
Interviewers may pivot from here into microservice system design, distributed tracing, rate limiting, idempotency, cache invalidation, or load-balancer algorithms. They may also ask you to write production-quality code for the aggregator, refactor legacy error handling, or design tests that reproduce a race, timeout, or transient downstream failure.
Further reading
-
Release It! by Michael Nygard — practical patterns for timeouts, circuit breakers, bulkheads, and production failure modes.
-
The Tail at Scale by Dean and Barroso — explains why tail latency dominates large fan-out systems and why hedging, deadlines, and isolation matter.
-
AWS Architecture Blog: Exponential Backoff and Jitter — clear treatment of why jitter prevents synchronized retry storms.
Featured in interview prep guides
Practice questions
- Design a resilient bootstrap APIDoorDash · Software Engineer · Technical Screen · medium
- Build Resilient Aggregation and Debug RoutingDoorDash · Software Engineer · Onsite · medium
- Investigate High Memory UsageDoorDash · Software Engineer · Onsite · medium
- Protect SLA and Choose StorageDoorDash · Software Engineer · Technical Screen · medium
- Design API that aggregates three downstream APIsDoorDash · Software Engineer · Technical Screen · medium
- Debug using logs and allocate tasksDoorDash · Software Engineer · Onsite · Medium
- Debug a cache incident end-to-endDoorDash · Software Engineer · Technical Screen · hard
- Debug a driver assignment bugDoorDash · Software Engineer · Onsite · Medium
- Design a service aggregator with robust error handlingDoorDash · Software Engineer · Technical Screen · hard
- Debug and refactor a legacy moduleDoorDash · Software Engineer · Onsite · Medium
- Implement a simple service with testsDoorDash · Software Engineer · Onsite · Medium
- Build an API aggregator with concurrency and retriesDoorDash · Software Engineer · Onsite · hard
Related concepts
- Scalable Service And Distributed System DesignSystem Design
- API Integration And External Service DesignSystem Design
- Idempotent API DesignSystem Design
- RESTful API And HTTP Service DesignSoftware Engineering Fundamentals
- API Design, Data Modeling, and IndexingSystem Design
- Reliability, Performance, And Infrastructure OperationsSystem Design