Switchback Experiments And Marketplace Interference

What's being tested

Interviewers are probing whether you can design and analyze causal experiments in two-sided marketplaces where one user’s treatment can affect another user’s outcome. At Uber, rider-facing, driver-facing, pricing, dispatch, and ETA changes often violate the standard no-interference assumption because drivers, riders, and trips compete for shared local supply. A strong Data Scientist must identify the right randomization unit, define a defensible estimand, choose metrics that reflect marketplace equilibrium, and explain how spillovers, carryover, noncompliance, and variance affect inference. The bar is not memorizing “use a switchback”; it is knowing when user-level A/B testing fails and how to recover credible causal evidence.

Core knowledge

SUTVA, the stable unit treatment value assumption, fails when one unit’s outcome depends on others’ treatment assignment. In rideshare, treating some riders with lower displayed ETA or some drivers with a supply push can change matching, prices, wait times, and availability for untreated users in the same market.
Interference should be handled by choosing units that contain the spillover. Common choices are market-level randomization, geo-cluster randomization, or switchback randomization by city-time block. User-level randomization is risky when supply, demand, dispatch, or pricing decisions interact through shared marketplace state.
Switchback experiments randomize treatment over time within a market, for example city-hour blocks alternating between control and treatment. They are useful when market-level randomization has too few cities, but require careful handling of time trends, autocorrelation, seasonality, and carryover between adjacent blocks.
Block length should exceed the system’s equilibration and carryover window. If a dispatch or pricing change affects driver positioning for 20–30 minutes, a 15-minute block is contaminated; a 60–120 minute block with guard periods may be more credible. Longer blocks reduce carryover but reduce sample size and power.
Guard periods are observations excluded near treatment switches to reduce contamination. If treatment changes at 10:00 and matching dynamics stabilize by 10:10, exclude the first 10 minutes of each block from analysis. This sacrifices precision for lower bias, which is usually worth it when carryover is plausible.
Estimands must be explicit. Intent-to-treat estimates the effect of assignment:
$ITT = E[Y \mid Z=1] - E[Y \mid Z=0]$
where $Z$ is randomized assignment. Treatment-on-treated or local average treatment effect may be relevant when actual exposure $D$ differs from assignment.
Instrumental variables are useful under noncompliance when random assignment affects treatment received but not outcomes except through treatment. The Wald estimand is:
$LATE = \frac{E[Y \mid Z=1] - E[Y \mid Z=0]}{E[D \mid Z=1] - E[D \mid Z=0]}$
This requires relevance, exclusion restriction, monotonicity, and independence from potential outcomes.
Cluster-robust inference is usually required because observations within a city-time block are correlated. Treating every trip as independent will understate standard errors. Analyze at the randomized unit level when possible, or use cluster-robust standard errors, block bootstrap, or randomization inference.
Marketplace metrics should include both local and global effects. For Uber-like systems, consider completed_trips, conversion_rate, request_to_pickup_time, driver_utilization, surge_multiplier, cancellation_rate, gross_bookings, earnings_per_online_hour, and rider/driver guardrails. Avoid optimizing one side while silently harming the other.
Variance reduction matters because switchbacks often have fewer independent units than trip-level data suggests. Use pre-period covariates, fixed effects for market and time, difference-in-differences style adjustment, or CUPED-like residualization:
$Y_i^{adj} = Y_i - \theta(X_i - \bar X)$
where $X_i$ is a predictive pre-treatment metric.
Power calculations should be based on independent randomized blocks, not raw events. A city with millions of trips may still have only a few hundred usable city-hour treatment assignments. Autocorrelation, day-of-week effects, and cluster imbalance can dominate nominal sample size.
Partial interference assumes spillovers occur within clusters but not across clusters. This can be plausible for city-level marketplaces, but weaker near geographic borders or airports. If clusters are too small, spillovers cross boundaries; if too large, the experiment may lack power.

Worked example

For “Design a switchback and choose block length,” a strong candidate would first clarify the treatment, the affected marketplace mechanism, and the expected lag: “Is this changing dispatch, pricing, driver incentives, or rider UI, and how quickly could drivers reposition or riders change request behavior?” They would state that user-level randomization is likely invalid if treated and untreated users share the same driver pool, so the natural unit is a market-time block such as city-hour or geo-hour. The answer would then be organized around four pillars: define the estimand and primary metrics, choose the randomization structure, handle carryover and seasonality, and specify the analysis plan.

A good design might randomize treatment by city and 2-hour blocks, stratified by hour-of-week so Friday evening treatment blocks are balanced against Friday evening control blocks. The candidate should explicitly mention guard periods, such as dropping the first 10–15 minutes after each switch if dispatch or driver location effects persist. For analysis, they might aggregate outcomes to block level and estimate a regression like $Y_{ct} = \alpha_c + \gamma_{hourweek} + \tau Z_{ct} + \epsilon_{ct}$ with cluster-robust or randomization-based inference. One key tradeoff is that shorter blocks increase the number of assignments and statistical power, while longer blocks reduce bias from carryover and allow the marketplace to equilibrate. A strong close would be: “If I had more time, I would run a pre-experiment using historical data to estimate autocorrelation and simulate power under different block lengths and guard-period choices.”

A second angle

For “Design an ETA experiment under interference,” the same causal issue appears, but the treatment is more likely rider-facing and may involve partial compliance or perception effects. If only some riders see a modified ETA, user-level randomization can still contaminate outcomes because treated riders may take trips that would otherwise be served by untreated riders. The candidate should consider market-time randomization or at least geo-cluster randomization and define whether the estimand is the effect of assigned ETA display or the effect of actually changing rider beliefs and behavior. Metrics would include request_conversion, rider_cancellation_rate, actual_wait_time, pickup_distance, and driver-side guardrails such as utilization. If assignment changes displayed ETA but not every rider notices or acts on it, instrumental variables can be introduced to estimate the effect among compliers, while still reporting the policy-relevant ITT.

Common pitfalls

Pitfall: Treating trips or users as independent when the randomized unit is market-time.

A tempting but wrong analysis is to run a trip-level t-test over millions of observations and claim extremely tight confidence intervals. That ignores within-block correlation and marketplace spillovers. A better answer aggregates to randomized units or uses cluster-robust/randomization-based inference aligned with the assignment mechanism.

Pitfall: Saying “randomize by city” without discussing power or heterogeneity.

City-level randomization reduces interference but may leave only a small number of independent units, and cities differ massively in demand patterns, regulation, airport share, and driver behavior. A stronger candidate explains the tradeoff between contamination and power, then proposes stratification, switchbacks, fixed effects, or matched markets as appropriate.

Pitfall: Jumping to instrumental variables without defending assumptions.

For IV, it is not enough to say assignment is random. You must argue that assignment strongly changes treatment received, affects outcomes only through that treatment, and does not create direct behavioral or marketplace effects outside the exposure channel. In Uber settings, exclusion restriction is often the hardest assumption because assignment can alter supply-demand balance for everyone in the cluster.

Connections

Interviewers may pivot from here to difference-in-differences, cluster randomized trials, power analysis, CUPED variance reduction, or network experiments. They may also ask how the same reasoning applies to driver incentives, surge pricing, dispatch algorithms, push notifications, ranking, or recommender evaluation in a marketplace.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts