DoorDash Data Scientist Interview Prep Guide
Everything DoorDash actually asks Data Scientist candidates — concept walkthroughs, worked examples, and the real interview questions, drawn from candidate reports. Free to read.
Last updated

Technical Screen
Data Manipulation (SQL/Python)
-
SQL — covered in depth under Onsite below.
-
Window Functions — covered in depth under Onsite below.

What's being tested
This tests analysis-grade data manipulation in pandas and SQL: cleaning messy inputs, joining heterogeneous tables, aggregating by time/customer/product segments, and ranking or deduplicating records correctly. Interviewers are probing whether you can produce trustworthy metric tables under realistic ambiguity: duplicate events, currency normalization, date granularity, missing values, and tie-breaking.
Patterns & templates
-
Groupby aggregation in
pandas:df.groupby(keys).agg(...)for revenue, counts, kWh, or sales totals; validate row grain before aggregating. -
Time bucketing with
pd.to_datetime,.dt.date,.dt.to_period("M"), or SQLDATE_TRUNC; avoid mixing timestamps and dates accidentally. -
Deduplication by business key using
drop_duplicates(subset=..., keep=...)orROW_NUMBER() OVER (...); declare deterministic tie-breakers. -
Join then normalize pattern: merge facts to lookup tables like exchange rates using
merge; check many-to-one assumptions before computing converted metrics. -
Ranking within groups via
rank,sort_values,cumcount, or SQLDENSE_RANK; specify whether ties should share rank. -
Conditional segmentation using
np.where,pd.cut,CASE WHEN, and boolean masks for Prime/non-Prime, price buckets, or customer cohorts. -
Streaming/counting basics for text-like inputs: use
collections.Counteror plaindict; Unicode normalization withunicodedata.normalizeand regex tokenization.
Common pitfalls
Pitfall: Aggregating before fixing grain. If order lines are duplicated or salaries repeat by country/date, totals and ranks become silently wrong.
Pitfall: Treating date joins as exact timestamp joins. Exchange rates, sales days, and meter readings often require explicit date extraction or as-of logic.
Pitfall: Returning code without explaining assumptions. Say how you handle nulls, duplicates, ties, currencies, and timezone/date boundaries.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions
Analytics & Experimentation
-
A/B Testing — covered in depth under Onsite below.
-
Causal Inference — covered in depth under Onsite below.
-
Marketplace Metric Frameworks — covered in depth under Onsite below.
-
Operational Delivery Quality Diagnostics — covered in depth under Onsite below.
-
Marketplace Interference And Switchback Experiments — covered in depth under Onsite below.
-
Unit Economics And Pricing Analytics — covered in depth under Onsite below.
Onsite
Data Manipulation (SQL/Python)

What's being tested
These questions test SQL analytical data manipulation for marketplace metrics: aggregating orders, joining customer/restaurant/event tables, grouping by time, ranking entities, and computing rates or percentiles. Interviewers are probing whether you can translate ambiguous DoorDash business definitions into correct SQL or Python logic with clean handling of timestamps, nulls, duplicates, and windowed metrics.
Patterns & templates
-
Time-based aggregation — use
`DATE_TRUNC`(`'month'`, `created_at`)or`EXTRACT`for monthly metrics; confirm timezone and order-status filters upfront. -
Conditional aggregation — compute rates with
`SUM`(`CASE WHEN condition THEN 1 ELSE 0 END) * 1.0 / `COUNT`(*); guard against zero denominators. -
Entity ranking — use
`ROW_NUMBER`,`RANK`, or`DENSE_RANK`with`PARTITION BY`for top customers, restaurants, or months; define tie-breaking explicitly. -
Windowed rolling metrics — use
`AVG`,`SUM`, or`COUNT`over`ROWS BETWEEN`or date-range windows; distinguish row-count windows from calendar windows. -
Percentiles and quartiles — use
`PERCENTILE_CONT`,`NTILE`(`4`), or`APPROX_PERCENTILE`; know whether the question needs exact thresholds or segmentation buckets. -
Event sequencing — use
`LAG`,`LEAD`, and`ROW_NUMBER()`OVER (`PARTITION BY order_id ORDER BY event_ts`) to identify request flows and latest valid states. -
Python equivalent — map
SQLpatterns to`pandas`:`groupby`,`agg`,`merge`,`rank`,`rolling`,`quantile`, and`shift`; watch memory for large tables.
Common pitfalls
Pitfall: Counting rows instead of distinct orders or customers can inflate metrics after joins, especially with order-event or item-level tables.
Pitfall: Treating “late,” “cold,” or “completed” as obvious definitions without asking for SLA thresholds, cancellation handling, and timestamp source.
Pitfall: Using
`WHERE`filters after a`LEFT JOIN`can silently turn it into an inner join and drop customers or restaurants with zero activity.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions

What's being tested
These problems test window-function analytics over customer, order, driver, restaurant, and city-level event data. You need to partition entities, order events in time, compare current rows to prior rows, rank within groups, and combine row-level windows with aggregations without losing analytical meaning.
Patterns & templates
-
Prior-event comparison — use
LAG(value) OVER (PARTITION BY customer_id ORDER BY order_ts)to compute spend changes, reorder gaps, or previous restaurant/city. -
Next-event comparison — use
LEAD(status) OVER (...)when analyzing driver request sequences, acceptance funnels, or what happened after a dispatch attempt. -
Ranking within groups — use
ROW_NUMBER,RANK, orDENSE_RANKwithPARTITION BY city_id/restaurant_id; choose based on tie behavior. -
Top-N per segment — wrap window output in a CTE, then filter
WHERE rn <= N; avoid filtering window aliases in the sameSELECT. -
Running totals and rolling metrics — use
SUM(amount) OVER (PARTITION BY user_id ORDER BY order_ts ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)for cumulative spend. -
Percentile segmentation — use
NTILE(4),PERCENT_RANK, orCUME_DISTfor quartiles; validate whether equal-sized buckets or true percentile thresholds are intended. -
Aggregate then window — first group to
restaurant/day/customerlevel, then apply windows; mixing raworderrows with grouped metrics often duplicates revenue.
Common pitfalls
Pitfall: Using
RANKwhen the interviewer expects exactly one row per group; useROW_NUMBERwith a deterministic tie-breaker likeorder_id.
Pitfall: Forgetting that window functions run after
WHERE, so filtering dates or statuses too early can remove rows needed for prior-event comparisons.
Pitfall: Ordering only by date when multiple orders occur on the same day; include timestamp and stable IDs to make results reproducible.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions
Analytics & Experimentation

What's being tested
Interviewers are probing whether you can design, diagnose, and interpret online experiments for ranking, feed, video, shopping, and homepage surfaces where user behavior is noisy and product decisions are high-stakes. For Pinterest, a Data Scientist must connect statistical rigor to user value: a feed-ranking change may increase CTR while hurting long-term saves, creator diversity, or shopping intent. You are expected to define metrics, choose the right randomization unit, estimate power, detect experiment validity issues, and explain causal impact clearly. You may also be asked what to do when clean randomization is unavailable, which tests your ability to reason about causal identification rather than just run a t-test.
Core knowledge
-
Randomized controlled trials estimate causal effects by making treatment independent of potential outcomes: . For Pinterest surfaces, randomize at the
user_idlevel when exposure persists across sessions; pageview-level randomization can create contamination if users see mixed feed experiences. -
Primary metrics should map to the product hypothesis. For a feed-ranking algorithm, candidates might use
engaged_sessions_per_user,saves_per_user,closeups_per_user, orrepins_per_user; for video pins,video_starts,watch_time, completion rate, and downstream saves may matter more than raw impressions. -
Guardrail metrics protect against local wins that harm the ecosystem. Common Pinterest-style guardrails include
session_length,hide_rate,report_rate,creator_distribution,search_return_rate,notification_unsub_rate, latency-sensitive engagement, and shopping funnel health such asproduct_clicksorcheckout_intent. -
Hypothesis framing should be explicit: null versus alternative or . Define the minimal detectable effect before launch, e.g., “detect a 0.5% relative lift in saves per user at 80% power and .”
-
Power and sample size depend on variance, baseline rate, significance level, and MDE. For a mean metric, approximate per-arm sample size is where is the absolute MDE. Smaller MDEs require quadratically more users.
-
Ratio metrics like
CTR = clicks / impressionsrequire care because numerator and denominator are both affected by ranking. Prefer user-level aggregation, delta method, bootstrap, or cluster-robust standard errors rather than treating every impression as independent. -
CUPED can reduce variance by adjusting for pre-experiment behavior: , where is a pre-period covariate and . It works well for stable user-level metrics like historical saves or prior engagement.
-
Sample ratio mismatch is a validity red flag: if traffic is intended 50/50 but observed assignment is 49/51 with large , run a chi-square test. Causes may include targeting filters, logging gaps, eligibility rules, or assignment bugs; do not interpret treatment effects until the mismatch is explained.
-
Multiple testing inflates false positives when slicing by many metrics or segments. Pre-register one primary metric, treat secondary metrics as diagnostic, and use methods such as Bonferroni correction, Benjamini-Hochberg FDR, or hierarchical decision rules when many comparisons drive launch decisions.
-
Heterogeneous treatment effects matter for Pinterest because new users, power users, creators, shoppers, and video-heavy users may respond differently. Segment analysis should be hypothesis-driven; avoid “segment fishing” unless clearly labeled exploratory and validated in a follow-up experiment.
-
Novelty and learning effects can distort short-term results. A new shopping module may initially attract curiosity clicks but not durable purchase intent; a ranking change may need days for users and models to adapt. Consider ramp duration, burn-in windows, and cohort-based readouts.
-
Non-randomized causal methods require identification assumptions. Without a control group, use difference-in-differences, synthetic control, interrupted time series, propensity score weighting, or matched controls, but clearly state assumptions like parallel trends, no concurrent shocks, and stable treatment exposure.
Worked example
For “Evaluate New Feed-Ranking Algorithm with A/B Testing”, a strong candidate would start by clarifying the product goal: is the new ranker optimizing engagement, long-term retention, content relevance, shopping intent, or creator ecosystem quality? They would ask about eligible users, exposure surface, rollout constraints, and whether the model changes only ranking or also candidate generation. The answer can be organized around four pillars: experiment design, metric framework, statistical plan, and diagnostics.
For design, randomize at the user_id level to avoid within-user contamination and keep treatment sticky across sessions. For metrics, choose one primary metric such as saves_per_user or engaged_sessions_per_user, plus guardrails like hide_rate, report_rate, session_length, and content diversity. For the statistical plan, define the MDE, estimate sample size using historical variance, and specify the inference method, likely user-level difference in means with CUPED if pre-period engagement is predictive. For diagnostics, check sample ratio mismatch, event logging sanity, exposure rates, pre-period balance, and whether treatment actually changed ranking outputs such as average rank position or content mix.
One tradeoff to flag is that CTR may improve if the ranker promotes clickbait-like pins, while saves or long-term return behavior may worsen; therefore, launch criteria should not rely on a single shallow engagement metric. A polished close would be: “If I had more time, I’d inspect heterogeneous effects for new versus retained users and run a longer holdout or follow-up to measure durability.”
A second angle
For “Recover causal effect without a control group”, the same experimentation mindset applies, but the central issue shifts from randomization to identification. Instead of saying “just compare before and after,” a strong candidate would ask whether there is a plausible untreated comparison: unaffected geographies, ineligible users, similar surfaces, or historical time periods. If no direct control exists, they might propose interrupted time series with seasonality controls, synthetic control built from comparable cohorts, or difference-in-differences if a credible comparison group can be found. The answer should emphasize assumptions and validation: pre-trend checks, placebo intervention dates, negative-control metrics, and sensitivity to concurrent launches. The conclusion should be probabilistic rather than overconfident, because observational estimates are usually less defensible than a clean randomized test.
Common pitfalls
Pitfall: Treating impressions as independent observations.
A tempting but wrong answer is to run a two-proportion z-test on all pin impressions and declare significance from millions of rows. The better approach is to aggregate to the randomization unit, usually user_id, because repeated impressions from the same user are correlated and the treatment is assigned at the user level.
Pitfall: Optimizing for the metric that moved instead of the metric that matters.
Candidates often say “CTR increased, so launch,” without asking whether the change harmed saves, long-term retention, shopping conversion quality, or negative feedback. A stronger answer separates the primary decision metric, secondary diagnostics, and guardrails, then explains how conflicting movements would be adjudicated.
Pitfall: Hand-waving causal inference when no control exists.
For no-control scenarios, “use regression” is not enough. Interviewers want to hear the identification assumption, why it might be plausible, how you would test it with pre-trends or placebo checks, and what uncertainty remains after the analysis.
Connections
This topic often pivots into metric design, causal inference, ranking evaluation, product analytics, and sequential testing. Be ready to discuss how offline recommender metrics like NDCG, calibration, or relevance labels relate imperfectly to online A/B outcomes, and how to diagnose metric movements by cohort, funnel stage, and content type.
Further reading
-
Trustworthy Online Controlled Experiments — Kohavi, Tang, and Xu — Practical reference for experiment design, diagnostics, variance reduction, and organizational decision-making.
-
Causal Inference: The Mixtape — Scott Cunningham — Clear treatment of difference-in-differences, synthetic control, matching, and observational causal assumptions.
-
Controlled Experiments on the Web: Survey and Practical Guide — Kohavi et al. — Seminal paper on web experimentation pitfalls including SRM, metrics, and online testing practice.
Practice questions

What's being tested
DoorDash causal inference interviews test whether you can move from “a metric changed” to “what caused it, how would we prove it, and what decision should we make?” The interviewer is probing for marketplace experimentation, counterfactual reasoning, metric design, and structured diagnosis across consumers, merchants, and Dashers. DoorDash cares because small operational changes—pay models, batching, merchant selection, grocery substitution flows, ETA promises—can shift multiple sides of the marketplace at once. A strong Data Scientist distinguishes correlation from causation, chooses the right unit of analysis, anticipates interference, and translates evidence into a launch or rollback recommendation.
Core knowledge
-
Causal estimand comes before method. State whether you need the average treatment effect: , treatment effect on treated, local effect, or segment-specific effect. For DoorDash, effects often differ by market, daypart, merchant type, weather, and order distance.
-
Randomized controlled trials are the gold standard when feasible. Define treatment, control, randomization unit, exposure window, primary metric, guardrails, and decision rule upfront. Common units include consumer, merchant, Dasher, delivery, store, or geo-market; the wrong unit can create contamination.
-
Interference is central in marketplaces. One Dasher’s incentive treatment can affect untreated consumers through supply reallocation, batching, and ETA changes, violating SUTVA. If spillovers are likely, prefer geo-level randomization, switchback designs, or explicit network/market-level analysis over naive user-level randomization.
-
Switchback experiments work well for operational levers that affect a market in real time, such as Dasher pay, dispatch logic, or batching. Randomize market-time blocks, for example city-hour or zone-daypart, and analyze with clustered standard errors. Watch for carryover effects if incentives alter Dasher positioning after the treatment block.
-
Difference-in-differences estimates causal effects when randomization is unavailable:
Its key assumption is parallel trends. Validate with pre-period trend plots, placebo tests, and matched controls before trusting the estimate. -
Regression adjustment improves precision but does not magically remove bias. A model like is credible when treatment assignment is as-good-as-random conditional on covariates. Include pre-treatment variables only; controlling for post-treatment mediators like actual delivery time can block part of the causal effect.
-
Instrumental variables can help when treatment exposure is endogenous, but the bar is high. A valid instrument must affect the outcome only through treatment and be correlated with treatment. In delivery settings, examples are rare; randomized eligibility or phased rollout timing can sometimes serve, while raw distance or demand level usually fails exclusion restrictions.
-
Metric hierarchy should separate primary outcomes, guardrails, and diagnostics. For late deliveries, primary metrics might be
% deliveries > promised ETA + 10 minandconsumer reorder rate; guardrails include Dasher earnings per active hour, merchant cancellations, refunds, and contribution profit. Diagnostics include pickup wait, travel time, batching rate, and prep-time error. -
Unit of analysis must match the causal question. For cold food complaints, order-level outcomes are natural, but consumer-level repeat behavior needs consumer aggregation. For merchant variety, treatment may be market-level because adding merchants changes the choice set for many consumers simultaneously.
-
Power and minimum detectable effect should be discussed quantitatively. For a binary metric, approximate standard error is . Rare events like cold-food complaints or grocery out-of-stock substitutions may need larger samples, longer tests, CUPED-style variance reduction, or a more frequent proxy metric.
-
Heterogeneous treatment effects matter operationally but should be planned. Segment by market density, distance band, merchant category, new versus existing consumers, and daypart. Avoid fishing across dozens of cuts without correction; use pre-specified segments, hierarchical modeling, or false discovery rate controls when many comparisons are explored.
-
Causal diagnosis is not only “run an experiment.” Start with metric validation, decomposition, temporal and geographic localization, cohort cuts, and funnel breakdowns. Then generate hypotheses and choose evidence: experiment if controllable, quasi-experiment if not, observational analysis for prioritization, and deep dives into logs or support labels as signal sources.
Worked example
For Diagnose and experiment to reduce late deliveries, a strong candidate would start by clarifying the definition of “late”: late relative to quoted ETA, promised delivery window, or actual food-ready time, and whether the business cares about order-level lateness, consumer experience, or merchant SLA. They would state assumptions such as “I’ll treat late delivery rate as the primary metric, with cancellation, refund rate, Dasher earnings, and reorder rate as guardrails.” The answer should be organized around four pillars: validate the metric and instrumentation, decompose the delivery lifecycle, identify causal hypotheses, and design experiments or quasi-experiments.
The diagnostic decomposition might split total delivery time into merchant prep time, Dasher assignment latency, travel-to-merchant, pickup wait, travel-to-consumer, and batching delay. The candidate should compare affected versus unaffected markets, dayparts, merchant types, weather conditions, and distance bands to isolate where the spike originates. If the leading hypothesis is inaccurate prep-time estimates, an experiment could adjust quoted prep-time buffers or merchant-facing prep prompts at the merchant or market level. If the hypothesis is Dasher supply shortage, a switchback test of incentives by zone-hour may be more appropriate than consumer-level randomization. A key tradeoff is that geo or switchback tests reduce interference but require more careful power calculations and clustered analysis. The close should be decision-oriented: “If the test reduces late rate without hurting Dasher earnings, merchant wait, or contribution margin, I’d recommend rollout; if I had more time, I’d estimate heterogeneous effects by market density and build a monitoring dashboard for recurrence.”
A second angle
For Define and Measure Merchant Variety's Impact on Consumers, the same causal toolkit applies, but the treatment is less like a single operational toggle and more like a change to the consumer choice set. The first task is defining merchant variety: count of available merchants, cuisine/category diversity, price-band coverage, distance-adjusted availability, or entropy of impressions/orders. A naive correlation between higher variety and higher order frequency is confounded because dense, high-demand markets naturally attract more merchants. Strong designs include geo-level expansion tests, phased merchant onboarding, matched market difference-in-differences, or synthetic controls for markets receiving new categories. The causal outcome should include consumer conversion and retention, but also merchant cannibalization, Dasher efficiency, and marketplace liquidity.
Common pitfalls
Pitfall: Treating correlation as causation.
A tempting answer is “markets with more merchants have higher retention, so variety improves retention.” That ignores demand density, income, urbanicity, marketing, and pre-existing consumer intent. A better answer states the counterfactual problem, proposes randomization or a quasi-experiment, and uses observational cuts only to prioritize hypotheses.
Pitfall: Optimizing one side of the marketplace in isolation.
For a Dasher compensation change, focusing only on delivery speed or acceptance rate is incomplete. Higher pay could improve fulfillment while hurting contribution margin, consumer fees, or merchant throughput; lower pay could preserve margin while creating supply shortages. DoorDash interviewers expect a balanced metric suite across consumers, merchants, Dashers, and business health.
Pitfall: Giving an experiment design without discussing interference and unit choice.
User-level A/B testing is often wrong for dispatch, pricing, selection, and operational levers because treated and control users share Dashers, merchants, and market capacity. Call out spillovers explicitly and choose market-level, store-level, cluster-level, or switchback randomization when marketplace equilibrium effects are likely.
Connections
Interviewers may pivot from causal inference into metric design, A/B testing and power analysis, root-cause analysis, marketplace dynamics, or ML model evaluation for ETA, ranking, and dispatch models. Be ready to discuss guardrail metrics, segmentation, novelty effects, multiple comparisons, and how offline model quality connects to online marketplace outcomes.
Further reading
-
Mostly Harmless Econometrics — rigorous intuition for regression, IV, difference-in-differences, and causal identification.
-
Trustworthy Online Controlled Experiments — practical experimentation guidance from large-scale product settings.
-
Causal Inference: The Mixtape — accessible treatment of matching, synthetic controls, DiD, and modern causal designs.
Practice questions

What's being tested
DoorDash is probing whether you can build a marketplace measurement framework for a three-sided system: consumers, Dashers, and merchants. Strong answers connect business goals like completed_orders, gross_order_value, and contribution_profit to operational realities like Dasher supply, delivery reliability, conversion funnels, and merchant availability. The interviewer is not looking for a generic metric list; they are testing whether you can define the right primary metric, choose meaningful guardrails, segment intelligently, and design an experiment or quasi-experiment that supports causal claims. DoorDash cares because marketplace changes often help one side while hurting another, so a Data Scientist must quantify tradeoffs rather than optimize a single surface metric.
Core knowledge
-
North Star metrics should reflect marketplace value creation, not just activity. For DoorDash, likely candidates include
completed_orders,order_volume,gross_order_value,delivery_success_rate, orcontribution_profit, depending on the problem. Always clarify whether the goal is growth, reliability, profitability, or liquidity. -
Three-sided marketplace decomposition is essential. A change in
completed_orderscan be decomposed into consumer demand, merchant supply, and Dasher capacity:
Fulfillment success itself depends on acceptance, pickup, delivery, cancellation, and lateness. -
Funnel metrics should separate intent from execution. For app install or ordering flows, track
landing_page_views,click_to_app_store,install_start,install_complete,first_open,signup_complete,first_order_started, andfirst_order_completed. A DS should define denominators precisely because a 5% lift ininstall_ratemay not translate tofirst_order_rate. -
Guardrail metrics protect marketplace health. Typical guardrails include
cancellation_rate,lateness_rate,refund_rate,support_contact_rate,Dasher_utilization,merchant_order_error_rate,average_delivery_time, andconsumer_NPS. A launch that increases demand but worsensdelivery_success_ratemay be a bad tradeoff. -
Segmentation is not optional in local marketplaces. Analyze by
market,submarket, cuisine, time of day, weekday/weekend, new versus returning consumers, Dasher vehicle type, merchant density, and weather or event periods. DoorDash effects are often heterogeneous: Los Angeles lunch may behave differently from suburban dinner. -
Metric hierarchy helps communicate clearly. Use a tree: business outcome → driver metrics → diagnostic metrics. For example,
completed_orders→checkout_conversion,available_merchants,average_delivery_fee,ETA,Dasher_supply_hours→assignment_rate,acceptance_rate,pickup_wait_time,merchant_prep_time. -
Experiment unit choice determines validity. Consumer-facing UI changes may randomize by user or session; Dasher incentives may require market-level or Dasher-level randomization; new-market rollouts often need geo-level designs. Watch for interference, where treatment changes supply-demand balance for untreated users in the same market.
-
Causal inference methods matter when randomized tests are impractical. Use difference-in-differences when treated and control markets have parallel pre-trends, synthetic control for one or few treated geographies, and interrupted time series for broad launches. Always inspect pre-period trends and seasonality before claiming causality.
-
Power and minimum detectable effect should be discussed at the metric level. A binary metric like conversion has approximate standard error ; rare events like fraud or severe lateness need much larger samples. For geo experiments, effective sample size is the number of markets, not the number of users.
-
Ratio metrics require care. Metrics like
delivery_success_rate = successful_deliveries / accepted_orderscan move because the numerator improves or because the denominator composition changes. Use numerator-denominator decomposition and consider delta-method, bootstrap, or cluster-robust standard errors. -
Marketplace tradeoffs should be made explicit. Lower delivery fees may increase
conversion_ratebut reducecontribution_margin; higher Dasher incentives may improveassignment_ratebut hurt profitability. A strong DS frames success as a constrained optimization: maximize primary metric subject to guardrails staying within pre-defined thresholds. -
Anomaly diagnosis should start broad, then narrow. Validate whether the drop is real, identify affected segments, decompose the metric, generate hypotheses, and test them with historical comparisons, cohorts, and counterfactuals. Do not jump to one cause like “fewer Dashers” without checking demand, merchant availability, app changes, pricing, and external shocks.
Worked example
For “Diagnose completed orders drop in Los Angeles”, a strong candidate would start by clarifying: “Is this a sudden or sustained drop, relative to what baseline, and is it isolated to LA or visible in comparable markets?” They would define the target metric precisely, such as completed_orders by order date, and ask whether the drop is absolute volume, seasonally adjusted volume, or conversion-adjusted volume. The answer should be organized around four pillars: first, validate the data and scope; second, decompose completed_orders into demand, conversion, and fulfillment; third, segment by geography, time, customer cohort, merchant type, and Dasher supply; fourth, test hypotheses against controls or historical baselines.
A good decomposition might compare traffic, cart_start_rate, checkout_conversion, payment_success_rate, assignment_rate, cancellation_rate, and delivery_success_rate. If LA traffic is stable but checkout conversion is down, investigate fees, ETA, app changes, or restaurant availability; if conversion is stable but fulfillment is down, inspect Dasher supply, acceptance, weather, and merchant prep time. A key tradeoff to flag is speed versus rigor: for an urgent operational issue, you would first produce a directional decomposition within hours, then follow with causal validation using matched markets or pre/post analysis. The candidate should avoid claiming causality from correlation and instead say what evidence would distinguish competing hypotheses. A strong close: “If I had more time, I’d build a market-level counterfactual using similar cities and quantify how much each driver explains of the LA shortfall.”
A second angle
For “Boost App Installs: Analyze and Experiment with Conversion Funnel”, the same framework applies, but the object is a consumer acquisition funnel rather than an operating-marketplace anomaly. The primary metric might be first_order_completed_per_landing_page_visitor, not simply app_install_rate, because installs only matter if they produce marketplace demand. The experiment could randomize mobile web visitors into different app-install prompts, deep links, incentives, or landing-page designs, with guardrails like bounce_rate, web_order_conversion, and cost_per_incremental_first_order. The main constraint is attribution: users may switch devices, install later, or already have the app, so the DS must define eligible cohorts carefully. Unlike LA diagnostics, this is more amenable to user-level A/B testing, though interference can still occur if incremental orders affect Dasher capacity in constrained markets.
Common pitfalls
Pitfall: Optimizing one metric while ignoring marketplace balance.
A tempting answer is “success is more orders” or “success is more installs.” That is incomplete at DoorDash because demand growth can worsen ETA, lateness_rate, or cancellation_rate if Dasher supply and merchant capacity do not scale. A better answer names a primary metric and 3–5 guardrails with explicit thresholds or decision rules.
Pitfall: Treating every analysis as a simple A/B test.
Many marketplace programs, especially new-market launches or Dasher supply changes, cannot be cleanly randomized at the user level. If treatment changes local liquidity, untreated users may be affected too, violating SUTVA. Strong candidates discuss geo experiments, difference-in-differences, synthetic controls, or staged rollouts when randomization is constrained.
Pitfall: Listing metrics without a causal story.
Interviewers are not impressed by a long dashboard inventory. They want to see how metrics connect: if completed_orders drops, what driver changed, what hypotheses explain that driver, and what evidence would confirm or reject each hypothesis? Use a metric tree and narrate the investigation path.
Connections
Interviewers may pivot from marketplace metrics into experiment design, especially unit of randomization, power, and guardrail interpretation. They may also move into causal inference, funnel analytics, cohort analysis, or ranking/recommendation evaluation if the marketplace change involves search results, ETA ranking, promotions, or personalized incentives.
Further reading
-
Trustworthy Online Controlled Experiments by Kohavi, Tang, and Xu — Practical reference for A/B testing, guardrails, ratio metrics, and experimentation pitfalls.
-
Mostly Harmless Econometrics by Angrist and Pischke — Strong foundation for difference-in-differences, instrumental variables, and causal reasoning in observational settings.
-
Causal Inference: The Mixtape by Scott Cunningham — Accessible treatment of modern causal methods, including synthetic control and event studies useful for geo-market analysis.
Practice questions

What's being tested
DoorDash is probing whether you can turn an operational symptom like cold food, late deliveries, or Dasher wait time into a rigorous diagnostic analytics and experimentation plan. A strong Data Scientist should define the right metric, decompose the end-to-end delivery funnel, isolate likely causes using observational data, and propose credible tests that distinguish correlation from causation. DoorDash cares because small changes in prep time, dispatch timing, routing, batching, or merchant handoff can materially affect customer retention, Dasher earnings, merchant trust, and marketplace efficiency. The interviewer is looking for structured thinking, statistical judgment, and an ability to balance marketplace tradeoffs rather than jumping to one explanation.
Core knowledge
-
End-to-end order lifecycle decomposition is the backbone: order created → merchant accepts → food prep → Dasher assigned → Dasher arrives at merchant → pickup → travel to customer → dropoff. Most delivery-quality metrics should be decomposed into stage-level durations before proposing interventions.
-
Primary metrics should map directly to customer harm, e.g.
late_delivery_rate,cold_food_complaint_rate,avg_delivery_lateness, orp90_delivery_time. Use a rate denominator carefully: complaints per delivered order can be biased by customer propensity to complain, while app-observed lateness is broader but less directly tied to sentiment. -
Guardrail metrics prevent local optimization. For late deliveries, guardrails might include
dasher_wait_time,merchant_cancellation_rate,refund_rate,dasher_earnings_per_active_hour,batch_acceptance_rate, andcustomer_rating. Reducing lateness by over-dispatching Dashers may improve customer ETAs while harming Dasher utilization. -
Metric decomposition should separate mean effects from tail effects. DoorDash quality issues often live in the right tail, so track
p50,p75,p90,p95, andp99, not just averages. For lateness, define and analyze both and conditional severity. -
Segmentation should be hypothesis-driven: merchant type, cuisine, prep-time volatility, geography, market density, time of day, weather, order size, distance, batching status, Dasher supply-demand ratio, and new versus repeat customers. Avoid slicing into hundreds of cohorts without multiplicity control or a prior rationale.
-
Counterfactual thinking matters because marketplace metrics are confounded. Rain can increase both delivery times and complaint rates; busy merchants may have longer prep times and more complex orders. Use matching, regression adjustment, difference-in-differences, or fixed effects to estimate whether a suspected driver actually caused the spike.
-
Difference-in-differences is useful when a change affects some markets, merchants, or time windows but not others. The basic estimand is . The key assumption is parallel trends, which you should validate visually and with pre-period placebo checks.
-
Regression modeling can quantify drivers but should not be oversold as causal. A model like
cold_complaint ~ delivery_time + wait_time + distance + weather + cuisine + market_fixed_effects + hour_fixed_effectshelps prioritize hypotheses. Interpret coefficients cautiously when features are downstream of each other. -
Experiment design should match the intervention level. If changing dispatch timing, randomize at order or merchant-market-time level depending on interference. If Dashers can receive both treatment and control orders in the same shift, spillovers can contaminate estimates; cluster randomization may be cleaner but lower-powered.
-
Power analysis should be discussed for rare events like cold-food complaints. For a binary metric, approximate sample size per arm scales as for 80% power and 5% two-sided alpha under common assumptions. If baseline complaint rate is 2%, detecting a 5% relative lift requires very large samples.
-
Instrumentation quality is fair game from an analysis lens: know which signals you would query, such as GPS pings, merchant tablet timestamps, ETA quotes, Dasher arrival events, customer complaint labels, refund reasons, weather, and support tickets. Do not design pipelines; instead, state how you would validate timestamp consistency, missingness, and label bias.
-
Marketplace interference is a central edge case. Treating one set of orders can affect Dasher availability, merchant congestion, or batching opportunities for untreated orders. For high-interference interventions, prefer market-level switchbacks, geo experiments, or time-block randomization over naive user-level A/B tests.
Worked example
For Diagnose cold-food spike and design experiments, a strong first 30 seconds would clarify: “Is the spike in complaint rate, refund rate, low ratings mentioning temperature, or all of them? Is it localized by market, merchant, cuisine, weather, or time? Did any product, dispatch, batching, or ETA policy change recently?” I would define the primary metric as cold_food_complaint_rate = cold_food_complaints / delivered_orders, with supporting metrics like delivery_duration_after_pickup, dasher_wait_time, merchant_prep_time, batch_rate, distance, refund_rate, and customer_rating.
The answer should be organized around four pillars: validate the metric, localize the spike, form causal hypotheses, and design interventions. First, I would check whether the complaint-label mix changed, because a support taxonomy or in-app prompt change could create an artificial spike. Second, I would decompose the lifecycle to see whether cold complaints correlate more with long post-pickup travel, long Dasher wait after food is ready, excessive batching, weather, or certain cuisines with low heat retention. Third, I would compare affected versus unaffected cohorts using pre/post trends, merchant fixed effects, weather controls, and distance controls.
A specific tradeoff to flag: reducing cold food by assigning Dashers earlier may reduce post-ready wait, but it can increase dasher_wait_time at merchants and lower Dasher earnings per active hour. For experiments, I might test a revised dispatch rule for high-risk orders, such as hot-food cuisines during cold weather and long-distance trips, randomizing at merchant-market-time blocks to reduce interference. The primary success metric would be cold complaints, while guardrails include late delivery, Dasher wait, cancellation, batching efficiency, and customer reorder rate. I would close by saying that, with more time, I would build a risk model for cold-food probability and evaluate whether targeted interventions outperform blanket dispatch changes.
A second angle
For Investigate Causes of Increased Driver Wait Time, the same diagnostic playbook applies, but the customer-facing symptom is less direct and the unit economics tradeoff is more explicit. I would define dasher_wait_time as time from Dasher arrival at merchant to pickup, then segment by merchant, cuisine, order size, prep-time estimate accuracy, acceptance time, and peak-hour congestion. The causal hypotheses shift toward merchant prep delays, early dispatch, inaccurate ready-time predictions, and pickup process friction. The experiment might test later dispatch or improved prep-time prediction, but guardrails become late_delivery_rate, customer_wait_time, and dasher_utilization. Unlike cold-food complaints, this metric is observed for nearly every order with valid GPS and pickup timestamps, so power is usually easier, but timestamp accuracy and arrival geofence definitions become more important.
Common pitfalls
Pitfall: Jumping straight to “bad weather caused it” without a counterfactual.
Weather is a plausible driver of late or cold deliveries, but it often coincides with demand spikes, Dasher supply shortages, and merchant congestion. A better answer compares similar markets or time windows with and without weather shocks, controls for demand and supply, and checks whether the spike remains after decomposition.
Pitfall: Defining one metric and ignoring marketplace guardrails.
A tempting answer is “optimize delivery time” or “reduce complaints” in isolation. DoorDash problems are multi-sided: an intervention can help customers while hurting Dashers or merchants, so strong candidates explicitly define primary, secondary, and guardrail metrics before proposing experiments.
Pitfall: Over-fragmenting the analysis into endless cuts.
Listing every possible segment sounds thorough but can signal lack of prioritization. Start with the order lifecycle, identify where the metric moved most, then segment within that stage using hypotheses; if many comparisons are run, mention false discovery risk and validation on a holdout period.
Connections
Interviewers may pivot from this topic into A/B testing, causal inference, marketplace experimentation, or ETA model evaluation. Be ready to discuss switchback experiments, interference, power for low-base-rate metrics, and how offline diagnostics translate into online decision-making.
Further reading
-
Trustworthy Online Controlled Experiments by Kohavi, Tang, and Xu — practical foundation for experiment design, guardrails, novelty effects, and decision quality.
-
Causal Inference: The Mixtape by Scott Cunningham — accessible treatment of difference-in-differences, fixed effects, and causal identification.
-
Mostly Harmless Econometrics by Angrist and Pischke — deeper reference for regression, natural experiments, and credible empirical strategy.
Practice questions

What's being tested
DoorDash is probing whether you can design causal experiments in a marketplace where one user’s treatment can affect another user’s outcome. Classic user-level A/B tests often fail because supply, demand, prices, batching, dispatch, and delivery times are coupled within a local market. A strong Data Scientist answer shows that you can choose the right randomization unit, control interference, define marketplace-wide metrics, and analyze results with appropriate fixed effects, clustered uncertainty, and operational guardrails. Interviewers are looking for practical judgment: not “run an A/B test,” but “what experiment is valid for this marketplace mechanism, and what tradeoffs does it create?”
Core knowledge
-
Marketplace interference occurs when treatment changes the shared environment: available Dashers, delivery ETAs, restaurant prep queues, batching opportunities, or consumer conversion. If treated users consume scarce supply, control users may see worse
ETA, causing a biased estimate of treatment impact. -
SUTVA — the Stable Unit Treatment Value Assumption — is often violated in delivery marketplaces. The assumption says each unit’s outcome depends only on its own treatment. For DoorDash, that can fail because orders, Dashers, merchants, and consumers interact through shared local supply-demand balance.
-
Switchback experiments randomize treatment by geography-time cells, such as
market_id × hourorzone × daypart, instead of by user. The treatment switches on and off over time within the same market, reducing contamination when marketplace state is shared locally. -
Randomization unit choice should match the interference radius. For dispatch, batching, fees, or delivery promises, prefer
zone × timeormarket × time. For consumer UI copy with minimal supply effects, user-level randomization may be acceptable. For merchant prep changes, merchant-level or geo-level designs may be cleaner. -
Primary metrics should reflect the full marketplace objective, not one side only. Common DoorDash metrics include
order_volume,conversion_rate,gross_order_value,contribution_profit,delivery_time,ETA_accuracy,Dasher_utilization,Dasher_earnings,cancellation_rate,refund_rate, and merchant acceptance or prep latency. -
Guardrail metrics catch harms hidden by the primary metric. For example, order batching may improve
Dasher_utilizationand profit but worsendelivery_time, cold food complaints,NPS, or reorder rate. A fee increase may improve per-order margin but reduce conversion or long-run retention. -
Difference-in-differences with fixed effects is a common analysis frame for switchbacks:
where is geography, is time, controls persistent market differences, and controls common time shocks. -
Clustered standard errors are usually required because observations within a market-time cell are correlated. If treatment is assigned at
zone × hour, uncertainty should be clustered at the randomization level or higher. Treating millions of orders as independent can massively overstate significance. -
Power calculations should use the effective sample size of randomized cells, not raw orders. A test with 50 markets over 14 days and 4 dayparts has roughly assignment cells before accounting for autocorrelation, imbalance, and clustering.
-
Stratification and blocking improve precision. Randomize within market, day of week, and daypart so treatment and control both cover lunch, dinner, weekdays, weekends, high-demand zones, and low-demand zones. This is especially important when demand is highly seasonal or weather-sensitive.
-
Carryover effects can bias switchbacks if treatment changes future states. For example, a bad delivery experience in a treated period may reduce future orders in a control period, or Dasher positioning from one hour may affect the next hour. Use washout periods or coarser time blocks when carryover is plausible.
-
Noncompliance and partial exposure should be measured explicitly. If a batching algorithm is “on” but only affects 20% of eligible orders, analyze both intent-to-treat and exposure-adjusted effects carefully. ITT preserves randomization; treatment-on-treated estimates need stronger assumptions.
Worked example
For Design and analyze a switchback experiment, a strong candidate would start by clarifying the intervention, the marketplace surface it affects, and the interference radius: “Is this changing dispatch, pricing, batching, or consumer experience, and does it affect shared Dasher supply within a zone?” They would state that a user-level A/B test is likely invalid if the treatment changes supply allocation or delivery timing, because treated and control orders compete for the same Dashers.
The answer should be organized around four pillars: randomization design, metric framework, statistical analysis, and operational risks. For design, propose randomizing zone × time block, such as zone-by-2-hour windows, stratified by market, day of week, and daypart. For metrics, define a primary marketplace metric like contribution_profit_per_order or orders_per_consumer_session, plus guardrails such as delivery_time, cancellation_rate, Dasher_earnings, and merchant_lateness.
For analysis, aggregate outcomes to the assignment cell and estimate a regression with zone fixed effects and time fixed effects, using standard errors clustered by zone or assignment cell. A specific tradeoff to flag is time-block length: shorter blocks create more randomized units and power, but increase carryover risk because Dasher locations and batching queues persist across adjacent periods. Longer blocks reduce carryover but create fewer independent observations and may be more exposed to demand shocks.
A strong close would say: “If I had more time, I’d run pre-experiment simulations using historical order streams to estimate power, check balance, select block length, and stress-test sensitivity to clustering and carryover.”
A second angle
For Evaluate Impact of $1 Fee on Fast-Food Profitability, the same interference logic applies, but the treatment is a pricing change rather than an operational algorithm. A naive consumer-level A/B test may appear reasonable, but if the fee reduces demand in some areas, it can free Dasher supply, improve ETAs, and indirectly affect control consumers nearby. The randomization unit could be consumer-level if the fee is small and supply effects are negligible, but market-time randomization is safer if demand shifts are expected to alter delivery speed or Dasher allocation.
The metric frame also changes: the primary metric should be something like profit_per_visitor or contribution_profit_per_session, not just fee revenue per order. The candidate should explicitly decompose impact into conversion loss, order mix changes, average basket size, delivery cost, refund/cancel rates, and repeat behavior.
Common pitfalls
Pitfall: Saying “randomize users 50/50 and compare average order value” for dispatch, batching, or supply-sensitive changes.
That answer ignores interference. A better response explains why treated and control users share Dashers and restaurants, then proposes geo-time randomization or a switchback design that aligns assignment with the level where marketplace state is shared.
Pitfall: Optimizing one side of the marketplace while ignoring the others.
For example, adding bicycle Dashers might improve supply density and reduce cost in dense zones, but could worsen long-distance delivery times, Dasher earnings mix, or merchant pickup congestion. Strong answers define consumer, Dasher, merchant, and business metrics, then name the intended primary metric and guardrails.
Pitfall: Treating order-level rows as independent observations.
If 200,000 orders come from 500 randomized zone-hour cells, the experiment has closer to hundreds of independent units than hundreds of thousands. The right move is to analyze at the assignment-cell level or use fixed effects with clustered standard errors, then discuss power in terms of clusters and intraclass correlation.
Connections
Interviewers may pivot from this topic into causal inference, especially difference-in-differences, synthetic controls, CUPED, and heterogeneous treatment effects. They may also ask about marketplace metric design, pricing experimentation, ranking and dispatch evaluation, or diagnosing a launch where conversion_rate, ETA, and profit move in opposite directions.
Further reading
-
Kohavi, Tang, and Xu, Trustworthy Online Controlled Experiments — Practical treatment of online experiments, guardrails, variance reduction, and common validity threats.
-
Blake and Coey, “Why Marketplace Experimentation Is Harder Than It Seems” — Seminal discussion of interference and marketplace experiment design, especially for two-sided platforms.
-
Angrist and Pischke, Mostly Harmless Econometrics — Strong foundation for fixed effects, clustered standard errors, difference-in-differences, and causal interpretation.
Practice questions
What's being tested
DoorDash is probing whether a Data Scientist can evaluate pricing, incentive, and fulfillment changes through the lens of unit economics, causal inference, and marketplace experimentation. Strong answers connect customer behavior, merchant pricing, Dasher supply, and contribution margin without drifting into engineering or purely strategic hand-waving. Interviewers are looking for metric judgment: what is the right success metric, what are the guardrails, how would you estimate impact credibly, and how would you handle interference across a three-sided marketplace? The best candidates can reason from both experimental evidence and observational data, quantify uncertainty, and explain tradeoffs between growth, profitability, and marketplace health.
Core knowledge
-
Unit economics means measuring profit per transaction, user, merchant, geography, or cohort. A basic delivery-order margin frame is:
Clarify whether fixed costs are excluded. -
Pricing changes affect both revenue and demand, so never evaluate a fee only by incremental dollars per order. Estimate while accounting for changes in
conversion_rate,order_frequency,AOV,retention,refund_rate, andDasher_utilization. -
Elasticity is central for fee and menu-price questions. Own-price elasticity is often approximated as:
If a -0.6$. Segment by cuisine, income proxy, geography, new versus existing users, and time of day. -
Marketplace interference is a major experimental risk. Treating consumers may change Dasher availability, delivery times, merchant batching, or marketplace liquidity for control users. For strong interference, prefer geo-randomization, switchback experiments, market-level randomization, or cluster-robust inference over naive user-level
A/Btesting. -
Randomization unit should match the mechanism. A consumer fee can often randomize by user, but Dasher programs or fulfillment modes may require randomizing by market, zone, or time block. Merchant price inflation analyses are usually observational, requiring matched item panels and careful sampling rather than simple randomized experiments.
-
Primary metrics should reflect the business decision. For a fee test, the likely primary metric is
contribution_profit_per_visitororcontribution_profit_per_order, not justrevenue_per_order. Guardrails includeorder_volume,conversion_rate,delivery_time,refund_rate,merchant_cancellations,Dasher_acceptance_rate, and customer retention. -
Intent-to-treat estimates are usually the default for experiments: compare all users assigned to treatment versus control, regardless of exposure. If only some users see the fee, report both ITT and treatment-on-treated using exposure or instrumental-variable logic, but do not let post-treatment filtering bias the result.
-
Sample size and power matter because profit metrics are noisy. For a two-sample mean comparison, a rough per-arm sample size is:
Use historical variance forprofit_per_user, not just order-level variance, and account for clustering if randomizing by market. -
Heterogeneous treatment effects are often the story in pricing. A fee may improve profit in dense urban markets but reduce conversion in suburban areas; bike couriers may reduce cost downtown but hurt ETA in low-density zones. Pre-specify key cuts to avoid cherry-picking: geography, customer tenure, merchant vertical, basket size, and supply-demand balance.
-
Observational price inflation studies need matched merchant-item panels. Compare DoorDash menu prices versus in-store prices for the same item, merchant, geography, and time period. Watch for item substitutions, menu reclassification, taxes/fees excluded from menu price, and survivorship bias if discontinued items vanish from the panel.
-
Forecasting inflation gaps should separate signal from sampling noise. Use weighted averages by order volume or merchant mix, report confidence intervals via bootstrap or cluster-robust standard errors, and consider simple baselines before complex models: seasonal naive, ARIMA, exponential smoothing, or hierarchical time-series models by category and market.
-
Decision quality requires translating estimates into launch logic. A statistically significant lift in
revenue_per_ordermay still be a bad launch if long-term retention falls or if Dasher supply worsens. Conversely, a small average effect may be compelling if concentrated in large, profitable segments with clean guardrails.
Worked example
For **“Evaluate Impact of 1 applies to all fast-food orders, a subset of users, specific markets, or certain basket sizes, and whether the goal is short-term contribution profit or long-term customer value. In the first 30 seconds, state that you would evaluate both incremental margin and demand response, because the fee mechanically increases revenue per completed order but may reduce conversion or frequency. Organize the answer around four pillars: experiment design, metric framework, interference/segmentation, and decision criteria.
For design, propose a randomized experiment at the consumer or market level depending on expected spillovers; if fast-food demand affects local Dasher availability, geo or zone-level randomization is safer. For metrics, use contribution_profit_per_eligible_user as the primary metric, with secondary diagnostics like orders_per_user, conversion_rate, AOV, fee_revenue, Dasher_pay_per_order, and refund_rate. For guardrails, include delivery_time, customer_retention, merchant_cancellations, and customer support contacts. Explicitly flag the tradeoff between user-level randomization, which is more powered, and market-level randomization, which better protects against marketplace interference. Close by saying that, with more time, you would estimate heterogeneous effects by market density, basket size, customer tenure, and competitor sensitivity before recommending a broad launch or targeted rollout.
A second angle
For “Assess Success Criteria for Bike-Courier Delivery Launch,” the same unit-economics logic applies, but the intervention changes fulfillment cost and reliability rather than consumer price. The primary question becomes whether bike couriers improve contribution_profit_per_order by lowering Dasher pay, parking delays, or batching inefficiency without worsening ETA, cancellation, or courier supply health. Randomization may need to happen by zone and time window because delivery supply is shared, making user-level treatment less credible. The analysis should also segment by density, distance, weather, order size, and merchant type because bikes may work well for short downtown trips but fail for long suburban routes. Instead of elasticity, the key mechanism is operational cost-to-serve and service quality impact.
Common pitfalls
Pitfall: Treating revenue lift as profit lift.
A tempting answer is “a 1 per order, so profitability improves.” That ignores fewer orders, different basket composition, higher refunds, retention effects, and potential Dasher or merchant cost changes. A stronger answer anchors on contribution_profit_per_eligible_user and decomposes the result into price, volume, and cost components.
Pitfall: Ignoring marketplace interference.
A naive A/B test can look clean statistically while violating the stable unit treatment value assumption. If treated customers reduce demand, control customers may see faster delivery or better Dasher availability, contaminating the estimate. Call out interference early and justify the randomization unit rather than defaulting to user-level assignment.
Pitfall: Overfitting the narrative with too many segments.
Segmentation is valuable, but listing twenty cuts without a plan sounds like fishing. Pre-specify the few segments tied to mechanism: density for bike couriers, basket size for fees, cuisine or merchant type for price inflation, and tenure for retention sensitivity. Then use exploratory cuts only to generate follow-up hypotheses.
Connections
Interviewers may pivot from here into experimental design, especially cluster randomization, switchbacks, power analysis, and sequential monitoring. They may also ask about causal inference for non-randomized launches, time-series forecasting for inflation gaps, or metric design for marketplace health and profitability.
Further reading
-
Trustworthy Online Controlled Experiments — Kohavi, Tang, and Xu’s practical guide to experiment design, metrics, guardrails, and pitfalls.
-
Causal Inference: The Mixtape — accessible treatment of difference-in-differences, matching, instrumental variables, and causal identification.
-
Mostly Harmless Econometrics — rigorous foundation for applied causal inference in observational business settings.
Practice questions
Behavioral & Leadership
What's being tested
Interviewers are probing your ability to influence cross-functional stakeholders and to choose the right analytical work under resource and time constraints. They want to see measurable prioritization: you must translate technical trade-offs (statistical confidence, instrumentation, implementation cost) into business-facing choices. At DoorDash this matters because data science work touches multi-sided marketplace metrics (consumer experience, Dasher earnings, platform reliability) and wrong prioritization creates downstream operational or trust costs.
Core knowledge
-
Stakeholder mapping: list stakeholders by interest (how outcome affects them) and influence (decision authority). Use a 2×2 grid to prioritize whom to convince first and what evidence each needs.
-
Prioritization frameworks: know RICE (Reach × Impact × Confidence / Effort) and ROI / Cost of Delay (CoD) as quantitative lenses to rank tasks when resources are limited.
-
Opportunity sizing: estimate expected value: — use back-of-envelope numbers to compare projects quickly.
-
Experiment vs. observational tradeoff: default to A/B testing for causal claims; when infeasible, use difference-in-differences or instrumental variables and be explicit about identifying assumptions.
-
Statistical confidence: know statistical power, Type I/II errors, and how to compute sample size for detectable effect via .
-
Metric design: prefer leading indicators and platform-level metrics (
DAU,conversion rate,take-rate) with guardrail metrics (cost per order, Dasher earnings); define numerator/denominator and one primary metric. -
Quick triage process: 1) define primary metric 2) estimate impact and confidence 3) estimate effort/data needs 4) recommend next step (pilot, experiment, monitor).
-
Implementation cost view: distinguish analysis-only (reports, cohort slices), feature flag experiments, and product changes requiring engineering; quantify engineering weeks where possible.
-
Risk and rollback plan: always propose concrete monitoring (alert thresholds,
p99latency, side-effects on earnings) and a rollback criterion tied to primary/guardrail metrics.
Tip: use a two-slide summary for executives — one-slide impact & decision, second-slide assumptions & risks — to accelerate buy-in.
- Communication mode: convert technical uncertainty to business terms (e.g., “50% chance of improving GMV by $X/month”) and specify what decision a stakeholder must make.
Worked example — Convince Stakeholders: Prioritize Data Science Projects Effectively
Frame: in the first 30 seconds ask which business metric leadership cares about (e.g., overall GMV, order completion rate, Dasher earnings), timeline (weeks vs months), and what constraints (no-engineering, A/B test window) exist. Skeleton: (1) quantify expected impact via opportunity sizing, (2) state confidence — can we causally estimate it? (3) estimate required effort and time-to-decision, and (4) propose a concrete next step (pilot, full experiment, or instrumentation). For example, produce a short table showing: project A EV = 80k/mo, confidence 80%, effort 1 analyst-week. Explicit tradeoff: prioritize lower-EV but high-confidence, low-effort projects early to build momentum, while reserving high-EV/low-confidence items for experiments. Flag a design decision: if an RCT is infeasible due to marketplace spillovers, propose a staggered roll-out and synthetic control as an alternative, noting identification limits. Close with a monitoring and rollback plan and a list of what you’d do with more time: deeper causal diagnostics, better instrument, or full marketplace simulation.
A second angle — Handle conflict and time-pressured decision
Under time pressure, synthesize evidence quickly: gather available point estimates, variance, and primary guardrails; produce a one-page recommendation with a clear binary ask. Present two options with expected outcomes and worst-case scenarios. If conflict arises between stakeholders (e.g., Product wants speed, Operations wants safety), propose a phased approach: immediate safe default (low-risk change plus monitoring), a parallel experiment for higher-risk option, and pre-agreed stop criteria. Emphasize listening: surface the single constraint that would make them change their preference (e.g., minimum acceptable increase in completion rate). Finally, commit to a short post-decision evaluation window and to transparent metrics reporting.
Common pitfalls
Pitfall: Choosing projects by novelty rather than measurable impact.
Many candidates list “interesting ML” or “cool model” as top priorities. Better: rank by expected business value and time-to-decision; an unvalidated model with high complexity often yields lower near-term impact.
Pitfall: Hiding uncertainty in point estimates.
Presenting a single number without a confidence interval or sensitivity analysis misleads stakeholders. Show ranges, probability statements, and the assumptions driving those ranges so decisions reflect uncertainty.
Pitfall: Ignoring guardrails and multi-sided effects.
Recommending changes that optimize consumer conversion but reduce Dasher earnings or increase cancellations misses marketplace balance. Always include guardrail metrics and a rollback threshold.
Connections
Interviewers may pivot to adjacent topics like experiment design (power calculations, sequential testing), metric instrumentation (event taxonomy and reliability), or model evaluation (offline metrics vs online A/B outcomes). Be prepared to show how prioritization links to concrete experiment or modeling plans.
Further reading
-
RICE: A simple prioritization framework — practical, quick method to score projects by reach, impact, confidence, and effort.
-
Trustworthy Online Controlled Experiments — Kohavi et al. (book/paper) — deep treatment of experiment design, diagnostics, and practical pitfalls in large-scale online testing.
Practice questions