Marketplace Metric Frameworks

What's being tested

DoorDash is probing whether you can build a marketplace measurement framework for a three-sided system: consumers, Dashers, and merchants. Strong answers connect business goals like completed_orders, gross_order_value, and contribution_profit to operational realities like Dasher supply, delivery reliability, conversion funnels, and merchant availability. The interviewer is not looking for a generic metric list; they are testing whether you can define the right primary metric, choose meaningful guardrails, segment intelligently, and design an experiment or quasi-experiment that supports causal claims. DoorDash cares because marketplace changes often help one side while hurting another, so a Data Scientist must quantify tradeoffs rather than optimize a single surface metric.

Core knowledge

North Star metrics should reflect marketplace value creation, not just activity. For DoorDash, likely candidates include completed_orders, order_volume, gross_order_value, delivery_success_rate, or contribution_profit, depending on the problem. Always clarify whether the goal is growth, reliability, profitability, or liquidity.
Three-sided marketplace decomposition is essential. A change in completed_orders can be decomposed into consumer demand, merchant supply, and Dasher capacity:
$\text{Completed Orders} = \text{Visitors} \times \text{Conversion Rate} \times \text{Checkout Success} \times \text{Fulfillment Success}$
Fulfillment success itself depends on acceptance, pickup, delivery, cancellation, and lateness.
Funnel metrics should separate intent from execution. For app install or ordering flows, track landing_page_views, click_to_app_store, install_start, install_complete, first_open, signup_complete, first_order_started, and first_order_completed. A DS should define denominators precisely because a 5% lift in install_rate may not translate to first_order_rate.
Guardrail metrics protect marketplace health. Typical guardrails include cancellation_rate, lateness_rate, refund_rate, support_contact_rate, Dasher_utilization, merchant_order_error_rate, average_delivery_time, and consumer_NPS. A launch that increases demand but worsens delivery_success_rate may be a bad tradeoff.
Segmentation is not optional in local marketplaces. Analyze by market, submarket, cuisine, time of day, weekday/weekend, new versus returning consumers, Dasher vehicle type, merchant density, and weather or event periods. DoorDash effects are often heterogeneous: Los Angeles lunch may behave differently from suburban dinner.
Metric hierarchy helps communicate clearly. Use a tree: business outcome → driver metrics → diagnostic metrics. For example, completed_orders → checkout_conversion, available_merchants, average_delivery_fee, ETA, Dasher_supply_hours → assignment_rate, acceptance_rate, pickup_wait_time, merchant_prep_time.
Experiment unit choice determines validity. Consumer-facing UI changes may randomize by user or session; Dasher incentives may require market-level or Dasher-level randomization; new-market rollouts often need geo-level designs. Watch for interference, where treatment changes supply-demand balance for untreated users in the same market.
Causal inference methods matter when randomized tests are impractical. Use difference-in-differences when treated and control markets have parallel pre-trends, synthetic control for one or few treated geographies, and interrupted time series for broad launches. Always inspect pre-period trends and seasonality before claiming causality.
Power and minimum detectable effect should be discussed at the metric level. A binary metric like conversion has approximate standard error $\sqrt{p(1-p)/n}$ ; rare events like fraud or severe lateness need much larger samples. For geo experiments, effective sample size is the number of markets, not the number of users.
Ratio metrics require care. Metrics like delivery_success_rate = successful_deliveries / accepted_orders can move because the numerator improves or because the denominator composition changes. Use numerator-denominator decomposition and consider delta-method, bootstrap, or cluster-robust standard errors.
Marketplace tradeoffs should be made explicit. Lower delivery fees may increase conversion_rate but reduce contribution_margin; higher Dasher incentives may improve assignment_rate but hurt profitability. A strong DS frames success as a constrained optimization: maximize primary metric subject to guardrails staying within pre-defined thresholds.
Anomaly diagnosis should start broad, then narrow. Validate whether the drop is real, identify affected segments, decompose the metric, generate hypotheses, and test them with historical comparisons, cohorts, and counterfactuals. Do not jump to one cause like “fewer Dashers” without checking demand, merchant availability, app changes, pricing, and external shocks.

Worked example

For “Diagnose completed orders drop in Los Angeles”, a strong candidate would start by clarifying: “Is this a sudden or sustained drop, relative to what baseline, and is it isolated to LA or visible in comparable markets?” They would define the target metric precisely, such as completed_orders by order date, and ask whether the drop is absolute volume, seasonally adjusted volume, or conversion-adjusted volume. The answer should be organized around four pillars: first, validate the data and scope; second, decompose completed_orders into demand, conversion, and fulfillment; third, segment by geography, time, customer cohort, merchant type, and Dasher supply; fourth, test hypotheses against controls or historical baselines.

A good decomposition might compare traffic, cart_start_rate, checkout_conversion, payment_success_rate, assignment_rate, cancellation_rate, and delivery_success_rate. If LA traffic is stable but checkout conversion is down, investigate fees, ETA, app changes, or restaurant availability; if conversion is stable but fulfillment is down, inspect Dasher supply, acceptance, weather, and merchant prep time. A key tradeoff to flag is speed versus rigor: for an urgent operational issue, you would first produce a directional decomposition within hours, then follow with causal validation using matched markets or pre/post analysis. The candidate should avoid claiming causality from correlation and instead say what evidence would distinguish competing hypotheses. A strong close: “If I had more time, I’d build a market-level counterfactual using similar cities and quantify how much each driver explains of the LA shortfall.”

A second angle

For “Boost App Installs: Analyze and Experiment with Conversion Funnel”, the same framework applies, but the object is a consumer acquisition funnel rather than an operating-marketplace anomaly. The primary metric might be first_order_completed_per_landing_page_visitor, not simply app_install_rate, because installs only matter if they produce marketplace demand. The experiment could randomize mobile web visitors into different app-install prompts, deep links, incentives, or landing-page designs, with guardrails like bounce_rate, web_order_conversion, and cost_per_incremental_first_order. The main constraint is attribution: users may switch devices, install later, or already have the app, so the DS must define eligible cohorts carefully. Unlike LA diagnostics, this is more amenable to user-level A/B testing, though interference can still occur if incremental orders affect Dasher capacity in constrained markets.

Common pitfalls

Pitfall: Optimizing one metric while ignoring marketplace balance.

A tempting answer is “success is more orders” or “success is more installs.” That is incomplete at DoorDash because demand growth can worsen ETA, lateness_rate, or cancellation_rate if Dasher supply and merchant capacity do not scale. A better answer names a primary metric and 3–5 guardrails with explicit thresholds or decision rules.

Pitfall: Treating every analysis as a simple A/B test.

Many marketplace programs, especially new-market launches or Dasher supply changes, cannot be cleanly randomized at the user level. If treatment changes local liquidity, untreated users may be affected too, violating SUTVA. Strong candidates discuss geo experiments, difference-in-differences, synthetic controls, or staged rollouts when randomization is constrained.

Pitfall: Listing metrics without a causal story.

Interviewers are not impressed by a long dashboard inventory. They want to see how metrics connect: if completed_orders drops, what driver changed, what hypotheses explain that driver, and what evidence would confirm or reject each hypothesis? Use a metric tree and narrate the investigation path.

Connections

Interviewers may pivot from marketplace metrics into experiment design, especially unit of randomization, power, and guardrail interpretation. They may also move into causal inference, funnel analytics, cohort analysis, or ranking/recommendation evaluation if the marketplace change involves search results, ETA ranking, promotions, or personalized incentives.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts