PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/System Design/DoorDash

Design a resilient bootstrap API

Last updated: May 22, 2026

Quick Overview

This question evaluates skills in resilient API aggregation, fault-tolerant microservice orchestration, and reliability engineering, focusing on defining API contract semantics and handling partial failures at the architectural/system-design level (service-to-service abstraction); category/domain: System Design.

  • medium
  • DoorDash
  • System Design
  • Software Engineer

Design a resilient bootstrap API

Company: DoorDash

Role: Software Engineer

Category: System Design

Difficulty: medium

Interview Round: Technical Screen

## Design a Resilient Bootstrap API When a client app loads, it needs to fetch everything required to render the first screen in a single call. That data lives behind three separate internal services, so you will build an aggregator (a "bootstrap" endpoint) that fans out to them and composes one unified response. ### Downstream services You are given three internal services (internal APIs): 1. **User Service** — `GET /user-to-consumer?user_id=...` → returns `{ consumer_id, user_profile... }` 2. **Payments Service** — `GET /payment-info?consumer_id=...` → returns `{ payment_methods... }` 3. **Address Service** — `GET /address-info?consumer_id=...` → returns `{ addresses... }` Note the dependency chain: the Payments and Address services are keyed on `consumer_id`, which only the User Service can produce from a `user_id`. ### What to build Design and implement a **Bootstrap API**: - **Endpoint:** `GET /bootstrap?user_id=...` - **Behavior:** take the input `user_id`, fetch the corresponding data from the downstream services, and return a single response that aggregates: - user / profile information - payment information - address information ### Core requirement The endpoint must be **as resilient to failures as possible**. Downstream services may be **slow, timing out, erroring, or intermittently / partially unavailable**, and the bootstrap response should **degrade gracefully** rather than fail outright. --- ### Constraints & Assumptions Anchor your design with the following working assumptions (confirm or adjust them with the interviewer): - The endpoint is on the **client's first-paint critical path**, so it is latency-sensitive — assume a target such as **p99 ≤ ~600 ms** end-to-end. - The operation is a **read-only `GET`** (inherently idempotent). - Typical microservice constraints apply: bounded thread/connection pools, shared infrastructure, no distributed transactions. - Treat exact SLO numbers, retry counts, and TTLs as **tunable** — state the figures you choose rather than leaving them implicit. ```hint Frame before you build Before drawing boxes, **classify each downstream dependency** by how much the response depends on it. The three are not interchangeable — look at the data-flow and decide what the endpoint can still return when each one is unavailable, then let that classification drive the entire design. ``` --- ## Part 1 — The API Contract Define the response shape and, crucially, what happens on **partial failures** — when some downstream data is available and some is not. Specify the HTTP-level and body-level semantics a client can program against. ```hint What the contract must express A bare `null` for a missing section is ambiguous. Make sure a client can tell **"this section is genuinely empty"** apart from **"we couldn't load this section"** — consider a per-section status/source field rather than relying on presence alone. ``` ```hint HTTP status vs. body status Decide what the **top-level HTTP status code** should mean. Think about whether *every* downstream failure deserves the same code, or whether the failures differ in how much they actually compromise the response the client asked for. ``` ## Part 2 — Orchestration: Ordering & Concurrency Describe how you sequence and parallelize the three calls, and how you bound total latency. ```hint Where the ordering is forced The dependency chain dictates that part of the work **cannot** be parallelized. Identify the forced-sequential step, then ask what can fan out *after* it. ``` ```hint Bounding latency For the parallel phase, total time should be `max()` of the calls, not their `sum()`. Also consider a **join deadline** so the slowest straggler can't hold the whole response hostage — what happens to a call that hasn't returned by the deadline? ``` ## Part 3 — Reliability Strategies Cover **timeouts, retries, circuit breakers, fallbacks, and caching**, and how they compose. ```hint Start with the cheapest control Of all the reliability mechanisms, one is the single most important and is essentially free: per-call **client-side timeouts** tuned per dependency. Without it, a slow downstream consumes your own threads/connections and the slowness cascades into your service. ``` ```hint Retry discipline Retries are safe here (idempotent `GET`), but unbounded retries amplify load exactly when a dependency is already struggling. Think about: which failures are worth retrying (transient vs. deterministic `4xx`), a small bounded retry count with **jittered backoff**, and skipping a retry that wouldn't fit the remaining time budget. ``` ```hint Stop cascades and define the fallback ladder Consider a **circuit breaker per downstream** (fast-fail instead of waiting on a timeout during an outage) plus **bulkheads** (per-dependency concurrency limits) so one sick dependency can't starve the others. Then design an explicit **graceful-degradation ladder** for failures — and be careful what you allow into the cache, since the cache doubles as your fallback. ``` ## Part 4 — Observability & Operational Considerations Describe what you measure, alert on, and can tune at runtime. ```hint What to instrument If some downstream failures are *designed* not to surface as HTTP `5xx`, then alerting only on `5xx` would hide real trouble. Think about a per-section **degradation rate**, per-downstream latency/error breakdowns, circuit-breaker state transitions, and distributed tracing with a propagated request id. ``` --- ### Clarifying Questions to Ask A strong candidate scopes the problem before designing. Reasonable questions include: - **How does each downstream affect what the screen can render?** Can the response still be useful if one or two sections are missing, or does the client treat all three as mandatory? Which call, if any, blocks every other piece of the response? - **Who calls this and how is it authenticated?** Is `user_id` trusted from the query string, or must it be derived from the authenticated principal? - **What is the latency SLO** for first paint, and what is the per-call budget for each downstream? - **Are partial responses acceptable to the client**, or must all three sections be present atomically? - **What is the staleness tolerance** per section — can addresses / profile / payment methods be served from cache, and for how long? - **What is the read volume / fan-out scale**, and are there per-user rate limits to respect? - **Are there correctness constraints on payments specifically** (e.g. must we never display a removed or expired payment method)? ### What a Strong Answer Covers Signals an interviewer is looking for (these are **dimensions to evaluate**, not the answers themselves): - **How the candidate reasons about each dependency's role** in the response, and whether the design's structure follows from that reasoning rather than treating all three calls symmetrically. - **How the API contract behaves under partial failure** — whether a client can programmatically tell apart the distinct outcomes a section can have, and how the HTTP-level semantics are chosen and justified. - **The quality of the orchestration** — handling of the forced ordering imposed by the dependency chain, the concurrency model, and how total latency is bounded under a time budget. - **The coherence of the reliability stack** — timeouts, retries, circuit breakers, bulkheads, and fallbacks — and whether the candidate explains *why each one* protects the caller and how they compose. - **How caching is governed** — the policy for what may and may not enter the cache, how TTLs relate to data volatility, and the correctness risks the candidate anticipates. - **Failure-mode reasoning** — distinguishing transient from deterministic failures, distinguishing a genuine empty result from an error, and surfacing the security / correctness edge cases. - **Observability and operational levers** — what is measured given that failures may not appear as `5xx`, and which knobs are tunable at runtime. - **Explicit tradeoffs** articulated for each lever. ### Follow-up Questions - How does your design change at **100x read volume**? What breaks first, and where do you add caching or capacity? - When a circuit breaker **closes after an outage**, what prevents a thundering herd from re-overwhelming the recovered dependency? - The `GET /bootstrap?user_id=...` signature lets a caller name *any* `user_id`. What is the **authorization risk**, and how do you close it? - For **payment methods**, is serving a cached (possibly stale) list ever worse than serving nothing? How do you decide? - How do you propagate the **remaining time budget** to downstream services so they can self-cancel work they can no longer deliver in time? Assume typical microservice constraints throughout, and state any further assumptions you make.

Quick Answer: This question evaluates skills in resilient API aggregation, fault-tolerant microservice orchestration, and reliability engineering, focusing on defining API contract semantics and handling partial failures at the architectural/system-design level (service-to-service abstraction); category/domain: System Design.

Related Interview Questions

  • Design a Food Rating System - DoorDash (medium)
  • Design Real-Time Driver Pay Aggregation - DoorDash (hard)
  • Design Food Ratings and Driver Payouts - DoorDash (medium)
  • Design personalized restaurant search and recommendations - DoorDash (medium)
  • Design a Customer Review Page - DoorDash (medium)
DoorDash logo
DoorDash
Feb 13, 2026, 12:00 AM
Software Engineer
Technical Screen
System Design
113
0

Design a Resilient Bootstrap API

When a client app loads, it needs to fetch everything required to render the first screen in a single call. That data lives behind three separate internal services, so you will build an aggregator (a "bootstrap" endpoint) that fans out to them and composes one unified response.

Downstream services

You are given three internal services (internal APIs):

  1. User Service — GET /user-to-consumer?user_id=... → returns { consumer_id, user_profile... }
  2. Payments Service — GET /payment-info?consumer_id=... → returns { payment_methods... }
  3. Address Service — GET /address-info?consumer_id=... → returns { addresses... }

Note the dependency chain: the Payments and Address services are keyed on consumer_id, which only the User Service can produce from a user_id.

What to build

Design and implement a Bootstrap API:

  • Endpoint: GET /bootstrap?user_id=...
  • Behavior: take the input user_id , fetch the corresponding data from the downstream services, and return a single response that aggregates:
    • user / profile information
    • payment information
    • address information

Core requirement

The endpoint must be as resilient to failures as possible. Downstream services may be slow, timing out, erroring, or intermittently / partially unavailable, and the bootstrap response should degrade gracefully rather than fail outright.

Constraints & Assumptions

Anchor your design with the following working assumptions (confirm or adjust them with the interviewer):

  • The endpoint is on the client's first-paint critical path , so it is latency-sensitive — assume a target such as p99 ≤ ~600 ms end-to-end.
  • The operation is a read-only GET (inherently idempotent).
  • Typical microservice constraints apply: bounded thread/connection pools, shared infrastructure, no distributed transactions.
  • Treat exact SLO numbers, retry counts, and TTLs as tunable — state the figures you choose rather than leaving them implicit.

Part 1 — The API Contract

Define the response shape and, crucially, what happens on partial failures — when some downstream data is available and some is not. Specify the HTTP-level and body-level semantics a client can program against.

Part 2 — Orchestration: Ordering & Concurrency

Describe how you sequence and parallelize the three calls, and how you bound total latency.

Part 3 — Reliability Strategies

Cover timeouts, retries, circuit breakers, fallbacks, and caching, and how they compose.

Part 4 — Observability & Operational Considerations

Describe what you measure, alert on, and can tune at runtime.

Clarifying Questions to Ask

A strong candidate scopes the problem before designing. Reasonable questions include:

  • How does each downstream affect what the screen can render? Can the response still be useful if one or two sections are missing, or does the client treat all three as mandatory? Which call, if any, blocks every other piece of the response?
  • Who calls this and how is it authenticated? Is user_id trusted from the query string, or must it be derived from the authenticated principal?
  • What is the latency SLO for first paint, and what is the per-call budget for each downstream?
  • Are partial responses acceptable to the client , or must all three sections be present atomically?
  • What is the staleness tolerance per section — can addresses / profile / payment methods be served from cache, and for how long?
  • What is the read volume / fan-out scale , and are there per-user rate limits to respect?
  • Are there correctness constraints on payments specifically (e.g. must we never display a removed or expired payment method)?

What a Strong Answer Covers

Signals an interviewer is looking for (these are dimensions to evaluate, not the answers themselves):

  • How the candidate reasons about each dependency's role in the response, and whether the design's structure follows from that reasoning rather than treating all three calls symmetrically.
  • How the API contract behaves under partial failure — whether a client can programmatically tell apart the distinct outcomes a section can have, and how the HTTP-level semantics are chosen and justified.
  • The quality of the orchestration — handling of the forced ordering imposed by the dependency chain, the concurrency model, and how total latency is bounded under a time budget.
  • The coherence of the reliability stack — timeouts, retries, circuit breakers, bulkheads, and fallbacks — and whether the candidate explains why each one protects the caller and how they compose.
  • How caching is governed — the policy for what may and may not enter the cache, how TTLs relate to data volatility, and the correctness risks the candidate anticipates.
  • Failure-mode reasoning — distinguishing transient from deterministic failures, distinguishing a genuine empty result from an error, and surfacing the security / correctness edge cases.
  • Observability and operational levers — what is measured given that failures may not appear as 5xx , and which knobs are tunable at runtime.
  • Explicit tradeoffs articulated for each lever.

Follow-up Questions

  • How does your design change at 100x read volume ? What breaks first, and where do you add caching or capacity?
  • When a circuit breaker closes after an outage , what prevents a thundering herd from re-overwhelming the recovered dependency?
  • The GET /bootstrap?user_id=... signature lets a caller name any user_id . What is the authorization risk , and how do you close it?
  • For payment methods , is serving a cached (possibly stale) list ever worse than serving nothing? How do you decide?
  • How do you propagate the remaining time budget to downstream services so they can self-cancel work they can no longer deliver in time?

Assume typical microservice constraints throughout, and state any further assumptions you make.

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More System Design•More DoorDash•More Software Engineer•DoorDash Software Engineer•DoorDash System Design•Software Engineer System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.