Marketplace Interference And Switchback Experiments

What's being tested

DoorDash is probing whether you can design causal experiments in a marketplace where one user’s treatment can affect another user’s outcome. Classic user-level A/B tests often fail because supply, demand, prices, batching, dispatch, and delivery times are coupled within a local market. A strong Data Scientist answer shows that you can choose the right randomization unit, control interference, define marketplace-wide metrics, and analyze results with appropriate fixed effects, clustered uncertainty, and operational guardrails. Interviewers are looking for practical judgment: not “run an A/B test,” but “what experiment is valid for this marketplace mechanism, and what tradeoffs does it create?”

Core knowledge

Marketplace interference occurs when treatment changes the shared environment: available Dashers, delivery ETAs, restaurant prep queues, batching opportunities, or consumer conversion. If treated users consume scarce supply, control users may see worse ETA, causing a biased estimate of treatment impact.
SUTVA — the Stable Unit Treatment Value Assumption — is often violated in delivery marketplaces. The assumption says each unit’s outcome depends only on its own treatment. For DoorDash, that can fail because orders, Dashers, merchants, and consumers interact through shared local supply-demand balance.
Switchback experiments randomize treatment by geography-time cells, such as market_id × hour or zone × daypart, instead of by user. The treatment switches on and off over time within the same market, reducing contamination when marketplace state is shared locally.
Randomization unit choice should match the interference radius. For dispatch, batching, fees, or delivery promises, prefer zone × time or market × time. For consumer UI copy with minimal supply effects, user-level randomization may be acceptable. For merchant prep changes, merchant-level or geo-level designs may be cleaner.
Primary metrics should reflect the full marketplace objective, not one side only. Common DoorDash metrics include order_volume, conversion_rate, gross_order_value, contribution_profit, delivery_time, ETA_accuracy, Dasher_utilization, Dasher_earnings, cancellation_rate, refund_rate, and merchant acceptance or prep latency.
Guardrail metrics catch harms hidden by the primary metric. For example, order batching may improve Dasher_utilization and profit but worsen delivery_time, cold food complaints, NPS, or reorder rate. A fee increase may improve per-order margin but reduce conversion or long-run retention.
Difference-in-differences with fixed effects is a common analysis frame for switchbacks:
$Y_{g,t} = \alpha_g + \gamma_t + \beta T_{g,t} + \epsilon_{g,t}$
where $g$ is geography, $t$ is time, $\alpha_g$ controls persistent market differences, and $\gamma_t$ controls common time shocks.
Clustered standard errors are usually required because observations within a market-time cell are correlated. If treatment is assigned at zone × hour, uncertainty should be clustered at the randomization level or higher. Treating millions of orders as independent can massively overstate significance.
Power calculations should use the effective sample size of randomized cells, not raw orders. A test with 50 markets over 14 days and 4 dayparts has roughly $50 × 14 × 4 = 2{,}800$ assignment cells before accounting for autocorrelation, imbalance, and clustering.
Stratification and blocking improve precision. Randomize within market, day of week, and daypart so treatment and control both cover lunch, dinner, weekdays, weekends, high-demand zones, and low-demand zones. This is especially important when demand is highly seasonal or weather-sensitive.
Carryover effects can bias switchbacks if treatment changes future states. For example, a bad delivery experience in a treated period may reduce future orders in a control period, or Dasher positioning from one hour may affect the next hour. Use washout periods or coarser time blocks when carryover is plausible.
Noncompliance and partial exposure should be measured explicitly. If a batching algorithm is “on” but only affects 20% of eligible orders, analyze both intent-to-treat and exposure-adjusted effects carefully. ITT preserves randomization; treatment-on-treated estimates need stronger assumptions.

Worked example

For Design and analyze a switchback experiment, a strong candidate would start by clarifying the intervention, the marketplace surface it affects, and the interference radius: “Is this changing dispatch, pricing, batching, or consumer experience, and does it affect shared Dasher supply within a zone?” They would state that a user-level A/B test is likely invalid if the treatment changes supply allocation or delivery timing, because treated and control orders compete for the same Dashers.

The answer should be organized around four pillars: randomization design, metric framework, statistical analysis, and operational risks. For design, propose randomizing zone × time block, such as zone-by-2-hour windows, stratified by market, day of week, and daypart. For metrics, define a primary marketplace metric like contribution_profit_per_order or orders_per_consumer_session, plus guardrails such as delivery_time, cancellation_rate, Dasher_earnings, and merchant_lateness.

For analysis, aggregate outcomes to the assignment cell and estimate a regression with zone fixed effects and time fixed effects, using standard errors clustered by zone or assignment cell. A specific tradeoff to flag is time-block length: shorter blocks create more randomized units and power, but increase carryover risk because Dasher locations and batching queues persist across adjacent periods. Longer blocks reduce carryover but create fewer independent observations and may be more exposed to demand shocks.

A strong close would say: “If I had more time, I’d run pre-experiment simulations using historical order streams to estimate power, check balance, select block length, and stress-test sensitivity to clustering and carryover.”

A second angle

For Evaluate Impact of $1 Fee on Fast-Food Profitability, the same interference logic applies, but the treatment is a pricing change rather than an operational algorithm. A naive consumer-level A/B test may appear reasonable, but if the fee reduces demand in some areas, it can free Dasher supply, improve ETAs, and indirectly affect control consumers nearby. The randomization unit could be consumer-level if the fee is small and supply effects are negligible, but market-time randomization is safer if demand shifts are expected to alter delivery speed or Dasher allocation.

The metric frame also changes: the primary metric should be something like profit_per_visitor or contribution_profit_per_session, not just fee revenue per order. The candidate should explicitly decompose impact into conversion loss, order mix changes, average basket size, delivery cost, refund/cancel rates, and repeat behavior.

Common pitfalls

Pitfall: Saying “randomize users 50/50 and compare average order value” for dispatch, batching, or supply-sensitive changes.

That answer ignores interference. A better response explains why treated and control users share Dashers and restaurants, then proposes geo-time randomization or a switchback design that aligns assignment with the level where marketplace state is shared.

Pitfall: Optimizing one side of the marketplace while ignoring the others.

For example, adding bicycle Dashers might improve supply density and reduce cost in dense zones, but could worsen long-distance delivery times, Dasher earnings mix, or merchant pickup congestion. Strong answers define consumer, Dasher, merchant, and business metrics, then name the intended primary metric and guardrails.

Pitfall: Treating order-level rows as independent observations.

If 200,000 orders come from 500 randomized zone-hour cells, the experiment has closer to hundreds of independent units than hundreds of thousands. The right move is to analyze at the assignment-cell level or use fixed effects with clustered standard errors, then discuss power in terms of clusters and intraclass correlation.

Connections

Interviewers may pivot from this topic into causal inference, especially difference-in-differences, synthetic controls, CUPED, and heterogeneous treatment effects. They may also ask about marketplace metric design, pricing experimentation, ranking and dispatch evaluation, or diagnosing a launch where conversion_rate, ETA, and profit move in opposite directions.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts