Describe your most impactful recent project
Company: DoorDash
Role: Software Engineer
Category: Behavioral & Leadership
Difficulty: medium
Interview Round: Onsite
Walk me through your most impactful recent project. What problem were you solving, what was your role and specific contributions, and how did you measure success? Deep-dive into one technical decision you made, a risk or failure you encountered, and what you would do differently now. If there are details you don’t recall, describe how you would quickly rediscover the context and validate your memory before making decisions.
Quick Answer: This question evaluates leadership, ownership, communication, and technical decision-making competencies by requesting a structured account of a candidate's most impactful recent project.
Solution
# How to Structure Your Answer
Use STAR+TDRR:
- Situation: Context and impact surface area.
- Task: Your objective and constraints.
- Actions: Your contributions and key technical decisions.
- Results: Measurable outcomes tied to business/user value.
- Technical deep-dive: One nontrivial decision, with options and trade-offs.
- Risk/Failure: What went wrong, response, and prevention.
- Reflection: What you'd do differently.
- Recall Plan: How you’d quickly rediscover context if memory is fuzzy.
---
# Example Answer (Software Engineering in a real-time marketplace/logistics context)
## Situation
We observed rising assignment latency in the dispatch pipeline that matches supply to demand in real time. P95 time-to-first-offer grew from ~2.2s to ~2.8s during peak, correlating with a drop in order conversion and higher cancellations. This impacted fulfillment speed and user satisfaction.
## Task
Reduce assignment latency and stabilize tail performance without sacrificing matching quality or causing regressions in fairness or cost. Nonfunctional constraints:
- P95 latency target: < 800 ms; P99 < 1.5 s.
- No more than +5% infra cost.
- Zero double-assignment incidents.
- Rollout within one quarter.
## Actions (Your Specific Contributions)
- Led the design and implementation of a new event-driven location and availability ingestion pipeline.
- Introduced an in-memory geospatial index powered by H3 cells, backed by Redis for fast candidate retrieval and a durable stream for replay.
- Split the monolithic matching endpoint into a stateless gRPC service with concurrency control, idempotency keys, and backpressure-aware queues.
- Added feature flags and a progressive rollout plan with automated rollback and guardrails.
- Partnered with data science to define success metrics and an A/B experiment, plus shadow traffic for pre-rollout validation.
## Measuring Success
- Primary metrics:
- P95 and P99 assignment latency (ms).
- Order conversion rate and cancellation rate.
- Incident rate: double-offers, timeouts.
- Secondary:
- Infra cost per 1k orders.
- Matching quality proxy (e.g., average distance).
- Validation:
- Shadow mode comparisons for 2 weeks.
- A/B test with power analysis and CUPED variance reduction.
Example metric definitions:
- Latency percentiles: P95 is the 95th percentile of assignment latency distribution.
- Conversion rate uplift: ΔCR = CR_variant − CR_control.
- Confidence: two-sided test; for binary metrics, use a z-test or a sequential test with pre-registered stopping rules.
Small numeric example:
- Baseline P95 = 2200 ms; target < 800 ms.
- After rollout, P95 = 650 ms (−70%), P99 = 1.2 s.
- Conversion +0.8 percentage points (e.g., 25.0% → 25.8%), p < 0.05.
- Cancellations −12% relative.
- Infra cost +3% (within budget).
## Technical Deep Dive: Event-Driven Location Updates vs. Polling
- Problem: The old system polled a database of supply locations every 500 ms. Poll skew and DB contention caused stale reads and spikes at peak.
- Options considered:
1) Faster polling with read replicas.
2) Change data capture (CDC) to a message bus with consumers updating an in-memory index.
3) Client-side streaming to the matcher over bidirectional gRPC.
- Decision: CDC to Kafka + Redis-backed H3 index for near-real-time updates; matcher consumes a local cache via Redis and falls back to the stream for misses.
- Trade-offs:
- Latency: Streams reduce staleness versus polling.
- Reliability: Durable log enables replay on failure; requires idempotency and compaction.
- Cost: Extra infra (brokers, memory), but lower DB load.
- Complexity: Needs schema versioning, backpressure handling, and consumer lag monitoring.
- Key design details:
- Idempotency via (entity_id, sequence_number) to discard out-of-order updates.
- Backpressure: bounded queues; when lag > threshold, degrade to broader-radius matching and emit alerts.
- TTLs for availability keys to prevent ghost supply; heartbeats every 5 s.
- Circuit breaker around Redis; fallback to a coarser H3 resolution if cache is warm but partial.
## Risk/Failure Encountered
- Incident: During a 25% canary, we saw oscillations in candidate sets. Root cause: two consumers with misaligned partitions caused out-of-order sequences for ~0.7% of updates; this created stale positions and increased rejections.
- Detection: SLO alert on P99 latency and a spike in dedupe drops. Experiment guardrail flagged a +0.3 pp cancellation increase.
- Response: Instant rollback via feature flag; replayed affected partitions after fixing consumer group assignment; added sequence checks and metricized late arrivals.
- Prevention: Enforced sticky partitioning by entity_id, added contract tests for schema and ordering, and a chaos test to simulate reorder/delay.
## Results
- P95 assignment latency: 2.2 s → 650 ms.
- P99: 3.6 s → 1.2 s.
- Order conversion: +0.8 pp (statistically significant, power ≥ 0.8).
- Cancellations: −12% relative.
- Infra cost: +3%; DB read IO −40%.
- No double-assignments over 30 days post-GA; on-call pages for matcher dropped by 60%.
## What I Would Do Differently
- Run load and soak tests earlier with realistic burst patterns to catch consumer partitioning issues.
- Define SLO/error budgets before coding; tie rollouts to error-budget policy.
- Add synthetic traffic generators for location updates to validate dedupe and ordering at scale.
- Invest in a schema registry and versioned protobufs upfront to reduce cross-team coordination cost.
## If I Don’t Recall a Detail: Rapid Recall and Validation Plan
- Artifacts first:
- Search design docs/RFCs and PR descriptions for decisions and trade-offs.
- Check dashboards (latency percentiles, consumer lag, error rates) and experiment summaries.
- Review incident tickets and postmortems for timelines and metrics.
- Cross-verify:
- Re-run saved queries/notebooks; reproduce A/B deltas with current data.
- Compare code paths via git blame and tags that correspond to rollout dates.
- People and alignment:
- Confirm numbers with the data scientist and SRE owner; sanity-check with PM.
- Guardrails before action:
- If memory and artifacts diverge, trust the dashboards and experiment logs; add a small canary or shadow test before making decisions that depend on the historical behavior.
---
# Tips to Adapt This to Your Own Story
- Pick a project with clear business impact and measurable technical outcomes.
- Quantify both user value (conversion, cancellations, NPS) and system health (P95/P99, error rate, cost).
- Show end-to-end ownership: design → rollout → measurement → iteration.
- Make the deep dive truly technical: data model, consistency, concurrency, or systems trade-offs.
- Be candid about a failure and what changed in your practice as a result.