An open-ended DoorDash software-engineer system-design screen: pick a real project and deep-dive its architecture end to end. Candidates draw the diagram, state invariants, walk read/write paths, justify storage and trade-offs, capacity-plan, cover reliability/security/observability with SLOs, recount a real incident, and propose a quantified 10x-traffic plan. Includes a full worked example built around a real-time delivery dispatch and live order-tracking service.
##### Question
Pick one of your recent, non-trivial projects and conduct a deep, end-to-end technical review. The interviewer will probe each layer, so be ready to whiteboard the system and defend your decisions. Address all of the following:
1. **Architecture diagram.** Draw the end-to-end architecture: components, data stores, and the interfaces (REST/gRPC/streaming/WebSocket) between them.
2. **Data flow and invariants.** Explain the major data flows and state the key correctness invariants the system must preserve (e.g. "exactly one active assignment per order").
3. **Read and write paths.** Walk through the concrete read path and write path step by step, including caching, idempotency, and the latency budget for each.
4. **Data and storage.** Describe the schema design and justify your storage choices (relational vs. NoSQL vs. cache vs. time-series/cold store), including partition/shard keys and indexes.
5. **Design decisions and trade-offs.** Justify your major technology choices and call out the trade-offs you accepted (consistency vs. availability, push vs. poll, write-through vs. write-behind, etc.).
6. **Scalability, bottlenecks, and capacity planning.** Give back-of-the-envelope numbers (QPS, message rates, storage), identify the bottlenecks, and explain how you'd capacity-plan for them.
7. **Consistency and reliability.** Explain your consistency model, delivery guarantees, and reliability strategies (retries, idempotency, sagas, circuit breakers, DR/RPO/RTO).
8. **Security and access controls.** Cover authN/authZ, service-to-service security, encryption in transit/at rest, PII handling, and compliance.
9. **Observability and SLAs/SLOs.** Describe your logs, metrics, and traces, and define concrete SLIs/SLOs and the alerts/runbooks that back them.
10. **A significant incident or trade-off you handled.** Describe a real production incident or hard trade-off decision, the root cause, and how you mitigated it.
11. **Two concrete improvements for a 10× traffic increase.** Propose two specific changes that would let the system absorb 10× traffic, and quantify why they work.
12. **What you would redesign today and why.** With hindsight, what would you change about the original design?
Use concrete numbers and real examples from your project; avoid hand-waving.
Quick Answer: An open-ended DoorDash software-engineer system-design screen: pick a real project and deep-dive its architecture end to end. Candidates draw the diagram, state invariants, walk read/write paths, justify storage and trade-offs, capacity-plan, cover reliability/security/observability with SLOs, recount a real incident, and propose a quantified 10x-traffic plan. Includes a full worked example built around a real-time delivery dispatch and live order-tracking service.
Pick one of your recent, non-trivial projects and conduct a deep, end-to-end technical review. This is an open-ended architecture deep-dive: there is no single reference design, so use a real system you built and own. The interviewer will probe each layer in turn, so be ready to whiteboard the system, quantify your claims, and defend your decisions under follow-up pressure.
Address all twelve parts below. Lead with requirements and the one or two invariants that make the problem hard before you draw boxes, drive the layers in order, and back every claim with concrete numbers and real examples from your project — avoid hand-waving.
Constraints & Assumptions
Use a
real, non-trivial project
you personally architected or owned a major part of — depth and defensibility matter more than the system's fame.
Assume the interviewer will go deep on
2–3 areas
of their choosing and expect concrete figures (QPS, latency percentiles, message rates, storage), not adjectives.
You will whiteboard. Treat the diagram as a contract: every component, data store, and edge interface (REST / gRPC / streaming / WebSocket) should be labelled.
"It was fast / it scaled" does not count as an answer — pair each qualitative claim with a measured or estimated number and the assumption behind it.
Clarifying Questions to Ask
These scope the whole deep-dive before you start drawing; ask them up front, then choose the project that best lets you answer them concretely:
Depth vs. breadth
— do you want me to drive all twelve layers at a steady depth, or go shallow on most and let you pick 2–3 areas to interrogate deeply?
Which project
— should I pick the system I owned end-to-end, or the one closest to your domain (e.g. high-throughput real-time, transactional, or batch/analytics)?
Level of the audience
— are we whiteboarding for correctness and trade-offs, or do you also want production numbers (real measured p95/p99, actual incident postmortems)?
Scope of "I"
— for a system built by a team, do you want only the parts I personally designed, or the full system with my ownership boundary called out?
Part 1 — Architecture diagram
Draw the end-to-end architecture: components, data stores, and the interfaces (REST / gRPC / streaming / WebSocket) between them.
What This Part Should Cover Premium
Part 2 — Data flow and invariants
Explain the major data flows and state the key correctness invariants the system must preserve (e.g. "exactly one active assignment per order").
What This Part Should Cover Premium
Part 3 — Read and write paths
Walk through the concrete read path and write path step by step, including caching, idempotency, and the latency budget for each.
What This Part Should Cover Premium
Part 4 — Data and storage
Describe the schema design and justify your storage choices (relational vs. NoSQL vs. cache vs. time-series / cold store), including partition / shard keys and indexes.
What This Part Should Cover Premium
Part 5 — Design decisions and trade-offs
Justify your major technology choices and call out the trade-offs you accepted (consistency vs. availability, push vs. poll, write-through vs. write-behind, etc.).
What This Part Should Cover Premium
Part 6 — Scalability, bottlenecks, and capacity planning
Give back-of-the-envelope numbers (QPS, message rates, storage), identify the bottlenecks, and explain how you'd capacity-plan for them.
What This Part Should Cover Premium
Part 7 — Consistency and reliability
Explain your consistency model, delivery guarantees, and reliability strategies (retries, idempotency, sagas, circuit breakers, DR / RPO / RTO).
What This Part Should Cover Premium
Part 8 — Security and access controls
Cover authN / authZ, service-to-service security, encryption in transit / at rest, PII handling, and compliance.
What This Part Should Cover Premium
Part 9 — Observability and SLAs/SLOs
Describe your logs, metrics, and traces, and define concrete SLIs / SLOs and the alerts / runbooks that back them.
What This Part Should Cover Premium
Part 10 — A significant incident or trade-off you handled
Describe a real production incident or hard trade-off decision, the root cause, and how you mitigated it.
What This Part Should Cover Premium
Part 11 — Two concrete improvements for a 10× traffic increase
Propose two specific changes that would let the system absorb 10× traffic, and quantify why they work.
What This Part Should Cover Premium
Part 12 — What you would redesign today and why
With hindsight, what would you change about the original design? Justify why the change is worth it.
What This Part Should Cover Premium
What a Strong Answer Covers Premium
Follow-up Questions
Be ready for the interviewer to push past your main answer with probes like:
"Walk me through exactly what happens to an in-flight request when your hottest datastore shard fails — what does the client see, and how does the system recover?"
"Your invariant holds in the happy path. Show me the precise interleaving where two writers could violate it, and where your design stops it."
"At 100× rather than 10×, which of your two improvements stops working first, and what breaks next?"
"If you had to drop one of consistency, availability, or latency under a regional outage, which goes and why?"