##### Question
Walk me through your most impactful (or proudest) recent project. Be ready to go deep on the technical details and the decisions you made.
1. What problem were you solving, and why did it matter?
2. What was your specific role and your individual contributions?
3. What measurable impact did it achieve, and how did you measure success?
4. Deep-dive into one or two key technical decisions and the trade-offs you weighed (design, algorithms, data, infrastructure).
5. How did you address scaling and reliability (SLOs, failure modes, safe rollout)?
6. Describe a risk or failure you encountered and how you responded.
7. If you could redo it today, what would you change and why?
8. If there are details you don't recall, how would you quickly rediscover the context and validate your memory before making decisions?
Quick Answer: A DoorDash software-engineering onsite behavioral question asking you to walk through your most impactful recent project: the problem, your specific role, measurable impact, a key technical decision and its trade-offs, how you handled scaling and reliability, a risk or failure you hit, and what you'd do differently. Includes a structured STAR-plus-deep-dive framework and a worked real-time-dispatch example.
Solution
This is an open-ended behavioral and technical deep-dive. The interviewer is looking for end-to-end ownership, genuine technical depth, data-driven impact, operational maturity (reliability/scaling), and honest reflection. Pick a project with clear business impact and measurable technical outcomes, then drive the conversation through a structured arc.
# How to structure your answer
Use STAR extended with a technical deep-dive, a risk/failure beat, reflection, and a recall plan:
- **Situation:** 1-2 sentences of context and why it mattered.
- **Task:** Your objective, responsibilities, and constraints (latency/cost/reliability targets, timeline).
- **Actions:** Your specific contributions and key technical decisions (design, algorithms, data, infra).
- **Results:** Quantified outcomes tied to user/business value, and how you validated them.
- **Technical deep-dive:** One non-trivial decision, the options you considered, and the trade-offs.
- **Scaling & reliability:** SLOs, failure modes, and how you rolled out safely.
- **Risk/failure:** What went wrong, how you detected and responded, and what you changed to prevent it.
- **Reflection:** What you'd do differently if you started today.
- **Recall plan:** How you'd rapidly rediscover context if your memory is fuzzy on a detail.
For each beat, tie every decision to an explicit trade-off and quantify impact wherever you can.
---
# Example answer (real-time delivery marketplace / logistics)
This example fits DoorDash's domain: a high-scale, on-demand delivery marketplace where the core engineering challenge is matching couriers (Dashers) to orders quickly and reliably under volatile, peak-driven demand.
## Situation
Our dispatch pipeline matches supply (couriers) to demand (orders) in real time. P95 time-to-first-offer had crept from ~2.2s to ~2.8s during dinner peaks, correlating with lower order conversion and higher cancellations. The existing assignment logic was greedy, polled the database every ~500ms for supply locations, and didn't batch orders, causing stale reads, DB contention, and suboptimal matches.
## Task
As tech lead (2 backend engineers + 1 data scientist), I owned designing and shipping a low-latency, reliable dispatch service that would:
- Recompute assignments continuously with a P95 < 800ms and P99 < 1.5s during peaks.
- Optimize jointly for customer ETA, courier utilization, and food freshness.
- Stay within +5% infra cost, produce zero double-assignment incidents, and roll out within one quarter.
## Actions (my specific contributions)
- Designed and led an event-driven ingestion pipeline: orders, courier pings, and store-readiness events emitted to Kafka, replacing DB polling.
- Introduced an in-memory geospatial index (H3 cells) backed by Redis for fast candidate retrieval, with a durable stream for replay.
- Split the monolithic matching endpoint into a stateless Go + gRPC service with concurrency control, idempotency keys, and backpressure-aware bounded queues; auto-scaled on Kubernetes (HPA on CPU + consumer lag).
- Partnered with data science to define success metrics, an A/B experiment, and shadow traffic for pre-rollout validation; added feature flags with automated rollback and guardrails.
### Core matching algorithm
We model each compute cycle as a min-cost assignment/flow problem over the top-K most urgent orders:
- Decision variable x_ij = 1 if courier i takes order j.
- Objective: minimize sum(x_ij * cost_ij) subject to each order assigned at most once and courier capacity respected.
- cost_ij = w1 * ETA_ij + w2 * detour_ij + w3 * freshness_penalty_j + w4 * fairness(i).
- Solver: network simplex within a 50ms time budget; on timeout, fall back to a greedy heuristic with lookahead to bound tail latency.
Small numeric example (illustrative): 2 couriers (C1, C2), 3 orders. Costs (lower is better): C1 -> O1=4, O2=8, O3=6; C2 -> O1=5, O2=3, O3=7. With capacity 1 per courier and the two most urgent orders prioritized, the optimal assignment is C1->O1 (4) and C2->O2 (3), total cost 7. If the solver times out, greedy-by-lowest-cost yields the same answer here; batching ensures we pick the top-K urgent orders per cycle.
## Measuring success
- Primary: P95/P99 assignment latency; order conversion rate; cancellation rate; double-offer/timeout incident rate.
- Secondary: infra cost per 1k orders; match-quality proxy (average courier-to-store distance); courier idle time.
- Validation: 2 weeks of shadow-mode comparison, then a city-level switchback A/B (alternating treatment/control in 30-minute windows) to mitigate marketplace interference, with CUPED variance reduction and pre-registered stopping rules. Guardrails (cancellation rate, app-crash rate, courier churn) must show no regression.
## Technical deep-dive: event-driven updates vs. polling
- **Problem:** Polling supply locations every ~500ms caused poll skew, DB contention, and stale reads at peak.
- **Options:** (1) faster polling against read replicas; (2) change data capture (CDC) to a message bus feeding an in-memory index; (3) client-side streaming over bidirectional gRPC.
- **Decision:** CDC to Kafka + a Redis-backed H3 index; the matcher reads a local cache and falls back to the stream on a miss.
- **Trade-offs:** Streaming cuts staleness vs. polling; a durable log enables replay but requires idempotency and compaction; extra broker/memory cost is offset by ~40% lower DB read IO; added complexity in schema versioning, backpressure, and consumer-lag monitoring.
- **Key design details:** Idempotency via (entity_id, sequence_number) to discard out-of-order updates; TTLs + 5s heartbeats on availability keys to prevent ghost supply; a circuit breaker around Redis with fallback to a coarser H3 resolution.
## Scaling and reliability
- **SLOs:** 99% of dispatch computations < 800ms (P99 < 1.5s); 99.9% availability during peaks; defined an error budget tied to rollout policy.
- **Backpressure:** Consumer-group max-lag alarms; when lag exceeds a threshold, degrade gracefully (broader-radius matching / coarser resolution) and emit alerts.
- **Failure modes & mitigation:** Kafka partition outage -> rebalance + replay from last committed offset; hotspot markets -> finer-geohash shard split with per-shard concurrency caps; map-provider degradation -> serve cached ETAs (60s TTL by road segment) with circuit breakers and stale-while-revalidate.
- **Safe deploys:** Canary in ~5-25% of markets, switchback experiments to limit network interference, and automatic rollback on SLO breach.
- **Observability:** Per-market dashboards (P50/P95/P99 latency, error-budget burn, match quality), nightly synthetic load tests, and chaos drills that simulate reorder/delay.
## Risk/failure encountered
During a 25% canary we saw oscillation in candidate sets. Root cause: two consumers with misaligned partitions produced out-of-order sequences for ~0.7% of updates, creating stale positions and extra rejections. We detected it via a P99 SLO alert plus a spike in dedupe drops, and the experiment guardrail flagged a +0.3pp cancellation increase. Response: instant rollback via feature flag, then replay of affected partitions after fixing consumer-group assignment. Prevention: sticky partitioning by entity_id, contract tests for schema and ordering, and a chaos test that injects reorder/delay.
## Results
- P95 assignment latency: 2.2s -> 650ms; P99: ~3.6s -> 1.2s.
- Assignment-service P99 (full pipeline): ~800ms -> ~160ms.
- On-time delivery rate: +4.5pp (90.2% -> 94.7%); average delivery time -7.8% at peak (P95 -10.4%).
- Order conversion: +0.8pp (statistically significant, power >= 0.8); cancellations -12% relative.
- Courier idle time: -12.3%; peak throughput +8.6% orders served without added couriers.
- Infra cost: +3% (within budget); DB read IO -40%; matcher on-call pages -60%; zero double-assignments over 30 days post-GA.
## What I would do differently
- Run load/soak tests with realistic burst patterns earlier to catch the consumer-partitioning issue before canary.
- Define SLOs and error budgets before coding and tie rollouts to error-budget policy from day one.
- Adopt a managed feature store / schema registry and versioned protobufs upfront to eliminate offline/online skew and reduce cross-team coordination.
- Ship a simpler v1 (tuned greedy + batching) behind a flag and introduce min-cost flow incrementally to capture most of the value faster.
## If I don't recall a detail: rapid recall and validation plan
- **Artifacts first:** Search design docs/RFCs and PR descriptions for the decision and its trade-offs; pull dashboards (latency percentiles, consumer lag, error rates) and experiment summaries; review incident tickets and postmortems for timelines and numbers.
- **Cross-verify:** Re-run saved queries/notebooks to reproduce the A/B deltas with current data; use git blame and release tags that map to rollout dates.
- **People & alignment:** Confirm figures with the data scientist and SRE owner, sanity-check with the PM.
- **Guardrails before acting:** If memory and artifacts diverge, trust the dashboards and experiment logs, and run a small canary or shadow test before any decision that depends on the historical behavior.
---
# Why this works in an interview
It demonstrates ownership (scope, timeline), technical depth (architecture, algorithm, data model), data rigor (experiment design, guardrails, significance), operational maturity (SLOs, failure modes, safe rollout), and honest reflection. Every decision is tied to an explicit trade-off and a quantified result.
**Tip:** Swap in your own domain and project. Keep one or two "diagrams in words," include a small numeric example, quantify both user value (conversion, cancellations) and system health (P95/P99, error rate, cost), and land on results plus learnings. Be candid about one real failure and what changed in your practice because of it.
Explanation
Rubric: this is an open-ended behavioral + technical deep-dive scored on ownership/scope, technical depth and correctness of trade-offs, data-driven impact (measurable, validated results), operational maturity (scaling, reliability, safe rollout, failure handling), and honest self-reflection. Strong answers also show structured communication (STAR + deep-dive) and the ability to reconstruct/validate context rather than asserting unverified details. The example uses DoorDash's real domain (a high-scale, on-demand delivery marketplace) so the technical choices are plausible for the role.