Design a resilient bootstrap API
Company: DoorDash
Role: Software Engineer
Category: System Design
Difficulty: medium
Interview Round: Technical Screen
## Design a Resilient Bootstrap API
When a client app loads, it needs to fetch everything required to render the first screen in a single call. That data lives behind three separate internal services, so you will build an aggregator (a "bootstrap" endpoint) that fans out to them and composes one unified response.
### Downstream services
You are given three internal services (internal APIs):
1. **User Service** — `GET /user-to-consumer?user_id=...` → returns `{ consumer_id, user_profile... }`
2. **Payments Service** — `GET /payment-info?consumer_id=...` → returns `{ payment_methods... }`
3. **Address Service** — `GET /address-info?consumer_id=...` → returns `{ addresses... }`
Note the dependency chain: the Payments and Address services are keyed on `consumer_id`, which only the User Service can produce from a `user_id`.
### What to build
Design and implement a **Bootstrap API**:
- **Endpoint:** `GET /bootstrap?user_id=...`
- **Behavior:** take the input `user_id`, fetch the corresponding data from the downstream services, and return a single response that aggregates:
- user / profile information
- payment information
- address information
### Core requirement
The endpoint must be **as resilient to failures as possible**. Downstream services may be **slow, timing out, erroring, or intermittently / partially unavailable**, and the bootstrap response should **degrade gracefully** rather than fail outright.
---
### Constraints & Assumptions
Anchor your design with the following working assumptions (confirm or adjust them with the interviewer):
- The endpoint is on the **client's first-paint critical path**, so it is latency-sensitive — assume a target such as **p99 ≤ ~600 ms** end-to-end.
- The operation is a **read-only `GET`** (inherently idempotent).
- Typical microservice constraints apply: bounded thread/connection pools, shared infrastructure, no distributed transactions.
- Treat exact SLO numbers, retry counts, and TTLs as **tunable** — state the figures you choose rather than leaving them implicit.
```hint Frame before you build
Before drawing boxes, **classify each downstream dependency** by how much the response depends on it. The three are not interchangeable — look at the data-flow and decide what the endpoint can still return when each one is unavailable, then let that classification drive the entire design.
```
---
## Part 1 — The API Contract
Define the response shape and, crucially, what happens on **partial failures** — when some downstream data is available and some is not. Specify the HTTP-level and body-level semantics a client can program against.
```hint What the contract must express
A bare `null` for a missing section is ambiguous. Make sure a client can tell **"this section is genuinely empty"** apart from **"we couldn't load this section"** — consider a per-section status/source field rather than relying on presence alone.
```
```hint HTTP status vs. body status
Decide what the **top-level HTTP status code** should mean. Think about whether *every* downstream failure deserves the same code, or whether the failures differ in how much they actually compromise the response the client asked for.
```
## Part 2 — Orchestration: Ordering & Concurrency
Describe how you sequence and parallelize the three calls, and how you bound total latency.
```hint Where the ordering is forced
The dependency chain dictates that part of the work **cannot** be parallelized. Identify the forced-sequential step, then ask what can fan out *after* it.
```
```hint Bounding latency
For the parallel phase, total time should be `max()` of the calls, not their `sum()`. Also consider a **join deadline** so the slowest straggler can't hold the whole response hostage — what happens to a call that hasn't returned by the deadline?
```
## Part 3 — Reliability Strategies
Cover **timeouts, retries, circuit breakers, fallbacks, and caching**, and how they compose.
```hint Start with the cheapest control
Of all the reliability mechanisms, one is the single most important and is essentially free: per-call **client-side timeouts** tuned per dependency. Without it, a slow downstream consumes your own threads/connections and the slowness cascades into your service.
```
```hint Retry discipline
Retries are safe here (idempotent `GET`), but unbounded retries amplify load exactly when a dependency is already struggling. Think about: which failures are worth retrying (transient vs. deterministic `4xx`), a small bounded retry count with **jittered backoff**, and skipping a retry that wouldn't fit the remaining time budget.
```
```hint Stop cascades and define the fallback ladder
Consider a **circuit breaker per downstream** (fast-fail instead of waiting on a timeout during an outage) plus **bulkheads** (per-dependency concurrency limits) so one sick dependency can't starve the others. Then design an explicit **graceful-degradation ladder** for failures — and be careful what you allow into the cache, since the cache doubles as your fallback.
```
## Part 4 — Observability & Operational Considerations
Describe what you measure, alert on, and can tune at runtime.
```hint What to instrument
If some downstream failures are *designed* not to surface as HTTP `5xx`, then alerting only on `5xx` would hide real trouble. Think about a per-section **degradation rate**, per-downstream latency/error breakdowns, circuit-breaker state transitions, and distributed tracing with a propagated request id.
```
---
### Clarifying Questions to Ask
A strong candidate scopes the problem before designing. Reasonable questions include:
- **How does each downstream affect what the screen can render?** Can the response still be useful if one or two sections are missing, or does the client treat all three as mandatory? Which call, if any, blocks every other piece of the response?
- **Who calls this and how is it authenticated?** Is `user_id` trusted from the query string, or must it be derived from the authenticated principal?
- **What is the latency SLO** for first paint, and what is the per-call budget for each downstream?
- **Are partial responses acceptable to the client**, or must all three sections be present atomically?
- **What is the staleness tolerance** per section — can addresses / profile / payment methods be served from cache, and for how long?
- **What is the read volume / fan-out scale**, and are there per-user rate limits to respect?
- **Are there correctness constraints on payments specifically** (e.g. must we never display a removed or expired payment method)?
### What a Strong Answer Covers
Signals an interviewer is looking for (these are **dimensions to evaluate**, not the answers themselves):
- **How the candidate reasons about each dependency's role** in the response, and whether the design's structure follows from that reasoning rather than treating all three calls symmetrically.
- **How the API contract behaves under partial failure** — whether a client can programmatically tell apart the distinct outcomes a section can have, and how the HTTP-level semantics are chosen and justified.
- **The quality of the orchestration** — handling of the forced ordering imposed by the dependency chain, the concurrency model, and how total latency is bounded under a time budget.
- **The coherence of the reliability stack** — timeouts, retries, circuit breakers, bulkheads, and fallbacks — and whether the candidate explains *why each one* protects the caller and how they compose.
- **How caching is governed** — the policy for what may and may not enter the cache, how TTLs relate to data volatility, and the correctness risks the candidate anticipates.
- **Failure-mode reasoning** — distinguishing transient from deterministic failures, distinguishing a genuine empty result from an error, and surfacing the security / correctness edge cases.
- **Observability and operational levers** — what is measured given that failures may not appear as `5xx`, and which knobs are tunable at runtime.
- **Explicit tradeoffs** articulated for each lever.
### Follow-up Questions
- How does your design change at **100x read volume**? What breaks first, and where do you add caching or capacity?
- When a circuit breaker **closes after an outage**, what prevents a thundering herd from re-overwhelming the recovered dependency?
- The `GET /bootstrap?user_id=...` signature lets a caller name *any* `user_id`. What is the **authorization risk**, and how do you close it?
- For **payment methods**, is serving a cached (possibly stale) list ever worse than serving nothing? How do you decide?
- How do you propagate the **remaining time budget** to downstream services so they can self-cancel work they can no longer deliver in time?
Assume typical microservice constraints throughout, and state any further assumptions you make.
Quick Answer: This question evaluates skills in resilient API aggregation, fault-tolerant microservice orchestration, and reliability engineering, focusing on defining API contract semantics and handling partial failures at the architectural/system-design level (service-to-service abstraction); category/domain: System Design.