Resilient API Aggregation And Operational Debugging

What's being tested

DoorDash is probing whether you can build and debug resilient service-to-service aggregation under real production constraints: partial failures, latency budgets, retries, concurrency limits, caches, and routing behavior. A strong Software Engineer answer shows you can turn multiple downstream calls into one reliable API response without creating retry storms or hiding failures. Interviewers are also looking for operational maturity: how you reproduce incidents, inspect logs/traces/metrics, isolate root cause, and add tests or guardrails so the same failure does not recur. This matters because marketplace systems depend on many microservices—consumer, merchant, dasher, dispatch, pricing, promotions—and a fragile aggregator can turn one slow dependency into a user-visible outage.

Core knowledge

API aggregation usually means a request fan-out to several downstream services and merges results into one response. Parallel fan-out changes latency from roughly $\sum t_i$ to $\max(t_i)$ , but increases concurrency, timeout coordination, and partial-failure complexity.
Concurrency primitives should fit the language: CompletableFuture in Java, asyncio.gather in Python, Promise.allSettled in JavaScript, goroutines plus context.Context in Go. Use bounded concurrency when fan-out can grow; unbounded parallelism can exhaust threads, sockets, or connection pools.
Timeouts need both per-call and overall budgets. If the API has a 500ms SLA, you might reserve 50ms for merge/serialization and split 450ms across downstream calls. Always propagate cancellation so losing work stops after the client no longer needs it.
Failure policies should be explicit. WAIT_ALL returns after all calls finish or timeout, useful when partial data is acceptable. FAIL_FAST cancels outstanding work after a critical dependency fails, useful when the response is invalid without that dependency.
Retries help only for transient failures such as HTTP 503, connection resets, or short timeouts. Use capped exponential backoff with jitter: $delay = min(base \cdot 2^attempt, cap) + random(0, jitter)$ . Do not retry non-idempotent writes unless you have an idempotency key.
Retry amplification is a common distributed-systems bug. If one frontend request fans out to 5 services and each retries 3 times, the backend may see up to 15 calls per user request. Add retry budgets, per-service limits, and circuit breakers.
Circuit breakers prevent repeatedly calling a known-bad dependency. A simple breaker has closed, open, and half-open states, using rolling failure rate or latency thresholds. Pair it with graceful degradation, not silent data corruption.
Load balancing affects both reliability and debuggability. Round-robin is simple and fair for similar hosts; least-connections adapts to variable request cost; consistent hashing preserves cache locality. Bad routing can overload one instance while aggregate fleet metrics look healthy.
Caching improves latency and protects downstream services, but adds failure modes: stale values, cache stampedes, hot keys, negative-cache poisoning, and inconsistent invalidation. Common mitigations include TTL jitter, request coalescing, stale-while-revalidate, and per-key locks in Redis or in-process caches.
Observability should connect a single user request across services. Use correlation IDs, structured logs, distributed tracing via OpenTelemetry, and metrics like request_rate, error_rate, p50, p95, p99, timeout count, retry count, cache hit rate, and downstream saturation.
Debugging production incidents should follow a disciplined loop: define the symptom, bound the blast radius, compare healthy versus unhealthy paths, form hypotheses, validate with data, mitigate first, then root-cause. Avoid changing multiple variables at once during mitigation.
Test coverage should include deterministic unit tests for merge logic, fake downstream services for timeout/retry behavior, concurrency tests for cancellation/races, and integration tests for partial failure. For legacy modules, add characterization tests before refactoring behavior.

Worked example

For Build an API aggregator with concurrency and retries, start by clarifying the contract: “Which downstream calls are required versus optional? What is the overall latency budget? Are requests read-only and safe to retry? Should partial responses include error metadata?” Then state assumptions, such as three downstream HTTP services, a 500ms overall timeout, and read-only idempotent calls.

A strong answer can be organized around four pillars: concurrent fan-out, timeout propagation, configurable failure policy, and retry control. For concurrent fan-out, describe launching one future per dependency with a bounded executor or async runtime, then merging results into a response object. For timeouts, use a parent deadline and derive per-call deadlines, ensuring cancellation propagates to outstanding futures when FAIL_FAST triggers.

For retries, propose retrying only transient errors with capped exponential backoff and jitter, while respecting the remaining request deadline. For failure policy, define WAIT_ALL as “collect successes and typed failures until the overall deadline,” and FAIL_FAST as “cancel siblings when a required dependency fails.” A concrete tradeoff to flag: aggressive retries can improve success rate but worsen tail latency and overload a degraded dependency, so retries should be limited by attempt count, deadline, and circuit-breaker state.

Close by saying you would add tests using fake services that fail once then recover, hang until timeout, return permanent 400 errors, and verify that cancellation, retry count, and partial-response semantics are correct. If you had more time, you would add metrics for per-dependency latency, retries, timeouts, and result quality so production behavior can be debugged without reading code.

A second angle

For Debug a cache incident end-to-end, the same resilience concept appears as an operational debugging problem rather than a greenfield design problem. The first move is to quantify the symptom: did p99 latency spike, did error rate increase, did downstream database load jump, or did users see stale/incorrect assignment data? Then compare cache metrics—hit rate, miss rate, evictions, hot keys, Redis CPU, connection count, and timeout rate—against the incident window.

The design instincts transfer: cache failure should degrade predictably, not cascade into a database overload or return corrupt data. Instead of discussing WAIT_ALL versus FAIL_FAST, you might discuss stale-while-revalidate versus bypassing cache, or whether negative caching caused valid entities to disappear temporarily. The best answer ends with both an immediate mitigation, such as disabling a bad key pattern or increasing TTL jitter, and a prevention step, such as adding cache-hit-rate alerts and load tests for cold-cache behavior.

Common pitfalls

Pitfall: Treating retries as a universal fix.

A tempting answer is “retry failed calls three times” without distinguishing transient failures from permanent ones. A better answer says which status codes are retryable, caps retries by deadline, adds jitter, and explains how to avoid retry amplification during downstream degradation.

Pitfall: Jumping to root cause before proving the symptom.

In debugging prompts, candidates often say “it’s probably the cache” or “the load balancer is uneven” too early. Land better by first naming the observable evidence you would gather: request IDs, traces, per-host traffic, cache hit rate, recent deploys, config changes, and healthy-versus-unhealthy comparisons.

Pitfall: Designing only the happy-path aggregator.

Some solutions show parallel calls and a merge function but skip cancellation, partial responses, timeouts, and testability. Interviewers want to see failure semantics as part of the API contract: what happens when one service is slow, wrong, unavailable, or returns after the overall deadline?

Connections

Interviewers may pivot from here into microservice system design, distributed tracing, rate limiting, idempotency, cache invalidation, or load-balancer algorithms. They may also ask you to write production-quality code for the aggregator, refactor legacy error handling, or design tests that reproduce a race, timeout, or transient downstream failure.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts