Discuss Projects and Planning Improvements
Company: DoorDash
Role: Software Engineer
Category: Behavioral & Leadership
Difficulty: medium
Interview Round: Technical Screen
Walk me through a past project you led end-to-end. Explain your current product’s business context, target users, and key challenges. Describe a cross‑quarter initiative you worked on—its goals, execution, results, and lessons learned. In hindsight, what could you have done better during the planning and design phases, and how would you change your approach next time?
Quick Answer: This question evaluates ownership, product understanding, end-to-end project leadership, cross-functional collaboration, and the ability to quantify impact within behavioral and leadership competencies for software engineers.
Solution
# How to Answer (Structure + Example)
A strong response uses a concise narrative with clear metrics and lessons. Use STAR+L: Situation (business context), Task (goals), Action (design/execution), Result (measured impact), Learnings (retro and next steps).
## 1) Business Context, Target Users, Key Challenges (Example)
- Business context: Two-sided, on-demand delivery marketplace. Orders come from consumers; merchants prepare items; couriers fulfill deliveries. The platform promises an ETA at checkout and manages dispatching.
- Target users: Consumers (reliable ETAs, on-time delivery), merchants (predictable prep time, throughput), couriers (fair, efficient assignments), internal ops (support, marketplace health).
- Key challenges:
- Spiky demand (lunch/dinner peaks), long-tail geographies, weather/traffic variability.
- Balancing three-sided incentives (on-time rate vs courier utilization vs merchant load).
- Low-latency, high-availability services; correctness under partial data.
- Training/serving skew for models; data drift from seasonality and promotions.
## 2) Past End-to-End Project (Model Answer)
Project: Real-time ETA and Dispatch Reliability
- Situation: Checkout ETA accuracy was inconsistent during peaks, driving cancellations and support tickets. Baseline p90 ETA error was ~9.5 minutes, late deliveries 18% at peak.
- Task (goals/OKRs):
- Reduce p90 ETA error by 30%.
- Reduce late deliveries by 20%.
- Maintain or improve courier utilization (>= 72%).
- Keep p95 service latency under 120 ms, 99.95% availability.
- Actions (design and trade-offs):
1) Data & instrumentation
- Standardized event schema for order creation, merchant ACK, courier accept/arrive/depart; introduced idempotent, typed events.
- Built a streaming feature pipeline (e.g., Kafka/Kinesis + Flink/Spark) for real-time features: current traffic, merchant prep time estimates, courier proximity, weather.
- Trade-off: Streaming adds infra cost/ops; mitigated by sampling and tiered storage.
2) ETA model & fallbacks
- Gradient-boosted trees with monotonic constraints for stability; WAPE/MAE as primary metrics.
- Backfill historical features via offline store; online feature store for serving.
- Deterministic heuristics fallback for cold-start merchants/areas.
- Trade-off: GBDT chosen over deep models for interpretability and faster iteration.
3) Dispatch coordination
- Adjust dispatch to respect predicted prep time and courier travel time; added batching heuristic for nearby orders.
- Safety constraints: cap walking distance, max wait, courier fairness score.
4) Service architecture
- Dedicated ETA microservice with circuit breakers and cached predictions.
- Canary + shadow mode; feature flags per region; rollback plan.
5) Experimentation & guardrails
- A/B test with guardrails: cancellations, support contacts, on-time rate, utilization, p95 latency, error budget.
- Kill-switch if cancellations +1.5% over control for 2 hours or p95 latency > 150 ms.
- Results:
- p90 ETA error improved 34% (9.5 → 6.3 min).
- Late deliveries reduced 22% at peak (18% → 14%).
- Consumer cancellations down 8.1%; checkout conversion up 1.2%.
- Courier utilization held at 73% (no negative impact); p95 service latency 104 ms; availability 99.97%.
- Net revenue +1.1% in treated regions; support tickets per order −9.4%.
- Lessons learned:
- Training/serving skew matters: small feature lag caused large peak-time errors; fixed with better watermarking and feature freshness monitoring.
- Cold-start merchants need explicit heuristics; adding category-based priors cut early errors.
- Align with ops early: small dispatch rule changes can affect courier behavior—co-design policies with ops.
Small metric example (ETA error):
- MAE = avg(|y_true − y_pred|). If true durations [30, 25, 40] min; predicted [27, 20, 45]: errors [3,5,5] → MAE = 4.0 min.
- p90 error: sort absolute errors and take 90th percentile; robust to outliers.
- WAPE = sum(|actual − forecast|) / sum(actual). Good when trip durations vary widely.
## 3) Cross-Quarter Initiative (Goals → Execution → Results → Learnings)
Initiative: Peak-Hour Reliability (Q1 discovery/design, Q2 build/launch)
- Goals:
- Q1: Map failure modes, finalize design doc and SLAs/SLOs, build shadow pipeline.
- Q2: Roll out ETA v2 to 30% regions; integrate dispatch constraints; achieve the OKRs above.
- Execution:
- Q1: Ran root-cause analyses on late orders; built shadow-service comparing predictions to ground truth; conducted RFC reviews with product, data science, SRE, and ops.
- Q2: Shipped v1 (ETA only) via canary; then v1.1 (dispatch tuning + batching); region-by-region rollout with holdouts.
- Results: Met or exceeded targets in 5/6 KPIs; a long-tail market with unreliable merchant signals lagged until we added stricter fallbacks.
- Lessons: Regional eligibility checks and staged rollouts saved us from widespread regressions; invest early in data quality SLAs with merchants.
## 4) In Hindsight: Planning & Design Improvements
- Design doc depth: Include explicit failure-mode and rollback matrix (by dependency: maps, merchant telemetry, feature store). Saves on-call time.
- Capacity planning: Model p99 QPS for peak events (weather spikes, holidays); we initially sized for average + 3σ and saw cache thrash under surge.
- Event contracts: Lock down schemas with versioning and consumer-driven contracts; an upstream nullable field caused silent skew.
- Metrics alignment: Agree early on north-star and guardrail metrics; we had a brief debate between MAE vs p90 driving inconsistent decisions.
- Shadow period: Extend shadow mode to capture seasonality; 1 week wasn’t enough to see payday and weekend effects.
- Privacy & governance: Flag PII early and route through data governance; avoid rework in logging and model features.
## 5) What I’d Change Next Time (Concrete Changes)
- Earlier RFC with ops/legal/support to surface policy and compliance constraints.
- Two-phase rollout by risk: start with ETA-only in low-risk regions, then add dispatch coupling.
- Add real-time feature freshness monitors with auto-degradation to heuristics if freshness > threshold.
- Define SLOs and error budgets up front; gate rollouts on SLO compliance.
- Treat data quality as a first-class dependency with SLAs and alerting.
## 6) Short “Interview-Length” Version (60–90 seconds)
- I led an end-to-end initiative to improve checkout ETAs and dispatch reliability for an on-demand delivery marketplace. The goal was to cut p90 ETA error by 30% and late deliveries by 20% without hurting courier utilization or latency SLOs. I designed a real-time feature pipeline and a GBDT model with deterministic fallbacks, built an ETA microservice with canary/shadow rollout, and coordinated dispatch tuning with ops. We improved p90 error by 34%, reduced late orders by 22%, cut cancellations by 8%, and maintained p95 latency at 104 ms and 99.97% availability. In hindsight, I would have added deeper failure-mode analyses, stricter event contracts, and extended shadow mode to capture seasonality. Next time I’ll stage rollouts by risk, add feature freshness monitors with auto-fallbacks, and formalize SLOs and error budgets before launch.
## 7) Pitfalls and Guardrails to Mention
- Guardrails: cancellation rate, support contacts, utilization, p95 latency, availability, and rollbacks on threshold breaches.
- Pitfalls: training/serving skew, cold-start, data drift (weather, promotions), dispatch fairness, feature freshness, and peak-time capacity.
Use this structure to tailor your own project story; keep metrics concrete, trade-offs explicit, and the retrospective honest with actionable next steps.