Predictive Modeling For Delivery And Marketplace Decisions

What's being tested

DoorDash Data Scientist interviews on this topic test whether you can turn messy marketplace prediction problems into decision-ready models, not just optimize an offline score. You are expected to define the target, prevent temporal leakage, choose evaluation metrics that map to business costs, and explain how model outputs would change dispatch, ranking, promotions, or customer communication. DoorDash cares because ETA accuracy, late-delivery risk, restaurant ranking, courier incentives, and uplift targeting directly affect conversion, on-time delivery, refund rate, Dasher utilization, and marketplace balance. The interviewer is probing for statistical judgment: how you would know the model is valid, whether it is calibrated, whether it causes better outcomes, and what guardrails prevent harm to customers, Dashers, or merchants.

Core knowledge

Target definition is the first modeling decision. For late-delivery risk, specify “late relative to promised delivery window by > X minutes” rather than vague lateness. For ETA, distinguish actual delivery duration, quoted ETA error, and customer-perceived lateness; each leads to different labels and interventions.
Temporal validation is mandatory in delivery prediction. Train on weeks 1–6, validate on week 7, test on week 8, preserving event time. Random splits leak future restaurant behavior, Dasher supply patterns, weather, batching logic, and marketplace conditions into the training distribution.
Feature leakage often appears as “helpful” operational signals. Post-assignment pickup time, actual Dasher arrival, customer support contact, cancellation reason, or final route duration cannot be used for pre-dispatch prediction. A strong answer states feature availability at decision time: checkout, assignment, pickup, or en route.
Model class choice should match interpretability, nonlinearity, and actionability. Logistic regression gives calibrated, explainable baselines; XGBoost or LightGBM often perform well on tabular marketplace data with nonlinear interactions; quantile regression or gradient-boosted quantiles are useful for ETA uncertainty.
Calibration matters when predictions drive thresholds. A model with AUC = 0.85 can still overstate risk. Use reliability plots, calibration-in-the-large, expected calibration error, and bin-level checks: for orders scored 0.20 risk, about 20% should be late. Consider isotonic regression or Platt scaling.
Ranking metrics depend on whether the model is used for ordering or absolute decisions. Use NDCG@K, MAP, pairwise accuracy, or top-decile lift for ranking; use AUC, PR-AUC, log loss, and calibration for probability decisions. For rare events, PR-AUC is often more informative than AUC.
Cost-sensitive decisioning converts predictions into actions. If intervention cost is $c$ and expected late-delivery loss is $L$ , intervene when $p(\text{late}) \times L > c$ . With capacity constraints, sort by expected incremental value, not raw risk, because high-risk orders may not be fixable.
Uplift modeling estimates treatment effect heterogeneity: $\tau(x)=E[Y(1)-Y(0)\mid X=x].$ For promotions, incentives, or proactive support, ranking by predicted outcome risk is not enough; prioritize users or orders with high incremental response. Evaluate using Qini curves, uplift@K, and randomized holdouts.
ETA modeling is both point prediction and uncertainty estimation. Optimize MAE or pinball loss for robust estimates, but also evaluate coverage: a predicted 80% interval should contain the actual delivery time about 80% of the time. Underpredicting ETA may increase conversion but harm trust.
Offline-to-online gaps are common in marketplace ML. A ranking model may improve offline relevance but reduce merchant diversity, increase delivery distance, or overload popular stores. A DS should propose an A/B test with primary metrics, guardrails, segment cuts, and pre-specified launch criteria.
Segmentation analysis protects against averaged-out failures. Check performance by geography, time of day, cuisine, merchant prep-time volatility, weather, new vs. repeat customers, bike vs. car courier, and order size. A model that helps suburban car deliveries may hurt dense urban bike deliveries.
Monitoring from a DS lens means watching model quality and business outcomes, not designing pipelines. Track score distribution shifts, calibration drift, feature missingness rates, MAE, late rate, cancellation rate, refund rate, Dasher wait time, and merchant throttling by segment after launch.

Worked example

For Build a late-delivery risk model, a strong candidate would start by clarifying the decision point: “Are we scoring at checkout, Dasher assignment, pickup, or while en route?” They would define the label precisely, such as an order being more than 10 minutes later than the promised ETA, and ask what action the score enables: dispatch adjustment, customer notification, refund prevention, or merchant intervention. The answer should then be organized around four pillars: target and data construction, feature design with availability constraints, validation and evaluation, and decision policy plus monitoring.

For features, they might include merchant historical prep-time variance, current demand-supply ratio in the zone, quoted distance, time of day, weather indicators, basket size, batching status if known at scoring time, and Dasher proximity if scoring after assignment. For validation, they should recommend a rolling time-based split and compare a simple baseline, such as historical late rate by merchant-zone-hour, against logistic regression and LightGBM. Evaluation should include PR-AUC if lateness is rare, calibration curves because scores drive actions, and business-weighted metrics such as expected preventable late minutes or cost per avoided late order.

One explicit tradeoff to flag is recall versus intervention cost: aggressively flagging orders may reduce late deliveries but overuse incentives, customer credits, or manual ops actions. A strong close would say: “If I had more time, I’d estimate which late orders are actually preventable, because a risk model alone may target unavoidable failures; I’d also evaluate fairness across neighborhoods, merchants, and courier modes.”

A second angle

For Evaluate a new ranking model, the same predictive-modeling discipline applies, but the unit of decision is an ordered list of stores or items rather than a single binary risk score. Offline metrics like NDCG@K or predicted conversion lift are useful, but DoorDash needs an experiment because ranking changes customer choice, merchant exposure, delivery distance, and kitchen congestion. The DS framing should include primary metrics such as conversion_rate or order_rate, guardrails like delivery time, cancellation rate, merchant concentration, and Dasher utilization, plus segment analysis for new users, dense markets, and cuisine categories. Unlike late-risk prediction, calibration may be less central than counterfactual bias: historical clicks reflect the old ranking policy, so logged position, impressions, and randomized exploration matter for credible evaluation.

Common pitfalls

Pitfall: Treating predictive accuracy as the business objective.

A tempting answer is “maximize AUC and launch if it improves.” That misses whether predictions are calibrated, actionable, and economically valuable. A better answer translates model scores into decisions, defines intervention thresholds, and evaluates incremental impact through experiments or randomized holdouts.

Pitfall: Ignoring time and decision-point leakage.

Many candidates list features like actual pickup time, final route duration, or support contacts because they correlate strongly with lateness. The interviewer wants you to say when the prediction is made and exclude signals unavailable at that moment. This is especially important in delivery systems where events unfold sequentially.

Pitfall: Giving an ML catalog instead of a marketplace answer.

Saying “I would try random forest, neural nets, and gradient boosting” is shallow unless tied to ETA, ranking, uplift, or dispatch decisions. Stronger communication names the baseline, explains why a model class fits the data, chooses metrics aligned to the action, and calls out marketplace side effects like merchant overload or Dasher wait time.

Connections

The interviewer may pivot from predictive modeling into experimentation, especially how to validate a model-driven policy online. They may also probe causal inference, uplift modeling, ranking evaluation, metric design, or calibration. For marketplace problems, expect follow-ups on heterogeneous effects, guardrail metrics, and why offline model gains may not translate into customer or Dasher outcomes.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts