Walk me through a complex, high-impact project in your domain. State the problem, constraints, stakeholders, your role, and success metrics. Dive deep into the technical design (architecture, data model, interfaces), key decisions and trade-offs, testing and rollout, and how you handled risks and failures. Use the STAR framework to illustrate ownership, bias for action, dive deep, and disagree-and-commit moments. Share measurable outcomes and what you would improve. Reserve a few thoughtful questions you would ask the interviewer about the team, roadmap, or culture.
Quick Answer: This question evaluates leadership, technical ownership, system design, and execution competencies by asking for a STAR-based deep dive into a complex project covering problem framing, architecture, data models, interfaces, testing, rollout, incident handling, and measurable outcomes.
Solution
# STAR Case Study: Real-Time Dispatch Optimization and ETA Revamp for a Two-Sided Delivery Marketplace
## Situation
A high-growth delivery marketplace faced increasing order volume during peak hours, leading to missed promised delivery times, courier idle time between trips, and higher cancellations. The legacy dispatch service matched couriers to orders with a simple greedy heuristic and relied on a batch ETA model updated daily. As city density and multi-order batching grew, the system’s decisions became suboptimal and brittle under load.
Pain points:
- 90th percentile delivery times exceeded promises in dense urban zones.
- Courier acceptance rates dropped during peak due to poor assignment quality and long pickups.
- ETA accuracy drifted during weather and special events, increasing customer support tickets.
## Task
Lead the design and rollout of a new real-time dispatch and ETA system that:
- Improves on-time delivery rate without adding courier supply.
- Preserves low end-to-end latency (p95 < 150 ms per decision path) under peak QPS.
- Supports dynamic multi-order batching with clear constraints (food freshness, merchant prep time, courier route).
- Degrades safely on partial failures and provides observability for ops.
Scope: I was the tech lead responsible for end-to-end architecture, cross-team alignment (data science, mobile, infra, ops), incremental delivery plan, and production readiness.
Success metrics (primary and guardrails):
- +3–5% absolute increase in on-time rate (OTR) citywide; maintain or reduce delivery cost per order.
- -8–12% reduction in 90th percentile delivery time in top 5 dense zones.
- +5% courier acceptance rate during peak; neutral-to-positive courier earnings/hr.
- ETA p50 error < 3 minutes; p90 error < 7 minutes, with drift detection.
- System p95 latency < 150 ms for dispatch decisions; 4-nines availability for the decision path.
## Action
I structured the solution into two coordinated tracks: a streaming ETA platform and a decision-service for dispatch/batching.
### Architecture
- Event backbone: Kafka (orders, merchant ready events, courier pings, assignment events). Partitions by city and geohash to localize state and reduce cross-zone chatter.
- Stream processing: Flink jobs compute real-time features (traffic, courier density, merchant prep patterns) and write to a feature store with freshness SLA < 10 s.
- ETA inference service: Low-latency gRPC stateless service hosting an XGBoost and a small neural reranker. Batch-warm models into memory, quantized for CPU inference. p95 < 20 ms.
- Routing engine: Deterministic time-dependent shortest path algorithm with cached travel-time matrices by geohash and time-of-day; fall back to fastest-edge cache on misses.
- Dispatch/batching service: Stateless decision microservice with sharded in-memory candidate state, making assignment decisions every 1–2 seconds per shard. Idempotent decisions and event-sourced state updates via Kafka.
- Control plane: Flag-based policy framework (e.g., enable multi-merchant batching, cap detour minutes by cuisine type). Dynamic updates via config service.
- Observability: Tracing (OpenTelemetry), cardinality-controlled metrics, red/black dashboards per city shard, anomaly alerts on ETA drift and assignment loops.
### Data Model (key entities)
- Order: order_id, merchant_id, created_at, promise_time, cuisine, temp_sensitivity, location (geohash), prep_time_prediction, special_constraints.
- Courier: courier_id, current_location (geohash), capacity (max bag slots), current_load, shift_end_time, acceptance_score, vehicle_type.
- Merchant: merchant_id, location, prep_time_distribution, working_hours, batching_allowlist.
- CandidateBatch: batch_id, [order_ids], route, predicted_total_time, freshness_risk, score.
- Events: courier_ping (5 s), merchant_ready, assignment_committed, cancellation, handoff.
Indexes: city + geohash composite keys; time-bucketed Kafka topics for replay; TTL on hot state to bound memory.
### Interfaces and Contracts
- ETA gRPC: GetETA(order_id | origin, destination, depart_time) -> {p50, p90, uncertainty, features_version}. SLO p95 < 25 ms.
- Routing REST/gRPC: GetRoute(orig, dest, waypoint[]) -> {distance, time, path_hash, confidence}.
- Dispatch gRPC: ProposeAssignments(shard_id) -> [AssignmentProposal]; CommitAssignment(proposal_id) -> Ack.
- Events: All decisions are published as immutable events; consumers update materialized views.
### Decision Logic (simplified)
- Candidate generation: For each unassigned order, find nearby couriers within N geohash rings; also generate augmentations to existing courier routes for batching.
- Scoring function (minimize cost subject to constraints):
Score = w1 * ETA_to_pickup + w2 * promise_violation_risk + w3 * courier_detour + w4 * freshness_risk - w5 * courier_utilization_gain
- Constraints: Max detour minutes by cuisine temperature, promise_time hard cap, courier capacity, merchant batching allowlist, fairness (rate-limit repeats to same courier).
- Efficient search: Beam search with pruning; stop when incremental score improvement < epsilon. Uses ETA service for marginal times and uncertainty for risk-aware scoring.
### Key Decisions and Trade-offs
- Greedy+rerank vs. global optimization: We debated a full min-cost flow solver each tick. Chose a hybrid: greedy candidate gen + ML reranker for near-optimal choices with tight latency. Disagree-and-commit moment: DS preferred full global; we agreed to instrument suboptimality gaps and revisit once p95 latency headroom increases.
- Streaming features vs. request-time joins: Request-time joins add latency and tail spikes. Opted for precomputed streaming features with freshness SLAs and versioned feature sets to simplify rollback.
- State management: Kept decision service stateless with event-sourced state to ease horizontal scaling and fast restarts; accepted slightly higher cross-call overhead.
- Model complexity vs. operability: Quantized XGBoost for baseline ETA and a small reranker for batching stability instead of a single large model, prioritizing debuggability and p95 latency.
### Testing Strategy
- Unit/integration: Contract tests for gRPC interfaces; golden test vectors for ETA and routing; chaos tests for missing features or stale caches.
- Offline replay/simulation: Replayed 4 weeks of Kafka events per city to compare legacy vs. new decisions. Measured on-time delivery, courier time-to-first-trip, idle time, and cost.
- Shadow traffic: Mirrored 20% of live decision requests to new stack, logged proposed assignments without committing; monitored delta metrics and latency.
- A/B and canary: Started with 1 small city (canary), then A/B in top 5 dense zones with 10% → 25% → 50% → 100% ramp, guardrails on cancellations, OTR, and courier acceptance.
### Rollout and Guardrails
- Guardrails:
- Auto-disable multi-order batching if p90 freshness_risk or complaints spike.
- Circuit breaker to legacy dispatch if assignment loop detected or p95 decision latency > 200 ms for 5 minutes.
- ETA drift monitor: Backtest against ground truth every hour; revert to previous model if p90 error > threshold.
- Runbooks and on-call: Created playbooks for cache warm failures, feature staleness, and partial city outages. Synthetic probes per shard to validate canary before each ramp.
### Risks and Failures (and how we handled them)
- Model drift during storms: ETA error spiked; triggered drift alert, auto-switched to a weather-aware fallback feature set, then rolled forward with a retrained model within 24 hours.
- Assignment loop bug: Rare race caused oscillating assignments for closely spaced orders. Mitigated with proposal leases and idempotent commit tokens; wrote a postmortem and added a deterministic tiebreaker.
- Mobile client compatibility: Older courier app versions didn’t display batched waypoints correctly. Hotpatched the API to downgrade responses for legacy clients; added a minimum app-version gate before enabling batching in a city.
### Ownership, Bias for Action, Dive Deep, Disagree-and-Commit
- Ownership: Coordinated cross-team milestones, defined SLAs, authored design docs, and drove incident postmortems to completion.
- Bias for action: Shipped an MVP to one pilot city in 6 weeks (ETA-only improvements) while building the full batching engine in parallel.
- Dive deep: Built a city-level simulator to quantify suboptimality of greedy vs. hybrid approaches; this data informed our staged rollout.
- Disagree-and-commit: After debating global optimization, we committed to the hybrid plan with explicit success criteria to revisit; later experiments showed <1.5% OTR gap at 3x the latency—validating the initial choice.
## Result
After 8 weeks of staged rollout across 7 cities:
- On-time rate: +4.6% absolute improvement citywide; +7.8% in top 2 dense zones.
- Delivery time p90: -11.2% reduction during peak; p50 reduced by 6.4%.
- Courier acceptance rate: +5.1% peak improvement; courier earnings/hr up 2.3% with no increase in average trip distance.
- ETA accuracy: p50 error 2.6 minutes, p90 6.3 minutes; customer support tickets related to late ETAs -14%.
- System reliability: Dispatch decision p95 latency 118 ms; 99.98% availability over first month.
- Financial: Net margin per order improved by 1.1% in target zones via better batching without harming customer NPS.
## What I Would Improve
- Global re-optimization window: Periodic (e.g., every 60 s) small-scale min-cost flow rebalancing for congested subgraphs to reduce tail risk.
- Personalized couriers constraints: Incorporate individual preferences (e.g., avoid high-rise pickups) learned from accept/decline patterns to boost acceptance and satisfaction.
- Better uncertainty modeling: Calibrated predictive intervals for ETA under rare events to reduce over-confidence and improve risk-aware scoring.
- Self-serve experimentation: A DSL for ops to adjust policy weights safely with real-time guardrails.
## Small Example to Illustrate Scoring
Two orders (A, B) near each other; one courier C near A.
- If C picks A then B: predicted_total_time = 18 min; promise_violation_risk = low; freshness_risk = medium.
- If C picks B only (no batch): predicted_total_time = 11 min; zero freshness risk for B but A waits longer for someone else.
- Hybrid score balances the marginal gain of batching (utilization) against added detour and freshness risk, often leading to batch if A and B are within 3–4 minutes detour and both promises are safe.
## Questions for the Interviewer
- How do you balance global marketplace efficiency vs. local user experience when they are in tension? What decision rights do teams have over these trade-offs?
- What are the highest-priority reliability risks in your current dispatch/ETA stack, and how does the team handle incident learning and prevention?
- How tightly integrated are engineering, data science, and operations in shaping the roadmap? Can you share an example of a cross-functional success or challenge?
- What is your approach to experimentation at scale (A/B, geo-rollouts), and what guardrails matter most for go/no-go decisions?
- How do you support engineers in owning systems end-to-end (from design to on-call to postmortems) while ensuring sustainable workload and growth?