This question tests the ability to design a switchback (time-based A/B) experiment for a two-sided marketplace with spillovers and autocorrelated demand. It evaluates expertise in causal inference, experiment design under market-level treatment constraints, and variance reduction techniques such as CUPED — core competencies for data scientist roles in experimentation.
Design a switchback experiment for airport pickup pricing in a marketplace with spillovers. Choose the block length and rotation schedule using empirical autocorrelation and carryover: median trip duration = 22 minutes; demand ACF falls below 0.1 at 75 minutes; strong peaks every ~4 hours. Specify: (a) block length selection to minimize carryover yet retain power, (b) diagnostics to detect residual carryover (pre/post contrasts, lag terms), and (c) variance and sample-size computation under block randomization (include formulas and required inputs). Describe how you will randomize across time-of-day and days-of-week and how you will incorporate covariate adjustment (e.g., CUPED) to reduce variance.
Quick Answer: This question tests the ability to design a switchback (time-based A/B) experiment for a two-sided marketplace with spillovers and autocorrelated demand. It evaluates expertise in causal inference, experiment design under market-level treatment constraints, and variance reduction techniques such as CUPED — core competencies for data scientist roles in experimentation.
Switchback Experiment Design: Airport Pickup Pricing with Spillovers
You are a data scientist designing a switchback (time-based A/B) experiment to evaluate a new airport pickup pricing algorithm in a two-sided ride-hailing marketplace. The airport has strong within-day seasonality and significant spillovers: drivers and the dispatch queue carry state across time, so a price change in one window affects supply and rider behavior in adjacent windows. Because of these spillovers, treatment must be assigned at the airport level, and only one arm (treatment or control) can be live at a time — you alternate the whole market between arms over time rather than splitting riders into two groups.
You are given the following empirical facts measured at this airport:
Median trip duration ≈22 minutes.
Demand autocorrelation function (ACF) falls below 0.1 at ≈75 minutes
(i.e., demand is meaningfully self-correlated for a little over an hour).
Strong periodic demand peaks every ≈4 hours
(driven by flight banks / arrival waves).
Design the experiment end to end. The question has three core parts plus two cross-cutting requirements: how you randomize across time-of-day and day-of-week, and how you use covariate adjustment (e.g., CUPED) to reduce variance.
Constraints & Assumptions
Single market, mutually exclusive arms.
One airport queue; at any instant the market is either fully on treatment or fully on control. This rules out a standard user-split A/B test.
Spillover horizon.
Carryover from one block can contaminate the next; the relevant timescales are trip duration (
≈22
min, how long the system takes to "flush" in-progress state) and the demand ACF horizon (
≈75
min).
No global control group.
Effects are estimated from time contrasts between treatment and control blocks, so you must account for time-of-day and day-of-week confounding by design.
Primary metric
is a continuous business outcome (e.g., revenue per request, completion rate, pickups per minute, or driver idle time) — analyze one pre-registered primary metric at a time.
Assume you have
historical data
at this airport (per-minute / per-request demand, pricing, completions, weather, flight schedules) to estimate variance components and fit covariate models before launch.
Clarifying Questions to Ask
What is the primary success metric and its decision direction?
(Revenue per request vs. completion rate vs. driver idle time can imply different washout and power needs.)
What minimum detectable effect (MDE)
matters for a launch decision, in absolute or relative terms?
How long can the experiment run
(days/weeks), and is continuous operation acceptable, or are there blackout periods?
Are multiple terminals served by one shared driver pool?
(If so they must be treated as a single cluster, not independent markets.)
Are any other pricing or dispatch experiments running concurrently
at this airport that would interfere with the same queue?
What is the operational minimum block length
the pricing/dispatch system can switch on without instability?
Part 1 — Block length and rotation schedule
Choose a block length and rotation schedule that minimizes carryover yet retains statistical power. Justify your block length quantitatively from the three empirical facts (trip duration, ACF horizon, 4-hour periodicity), and explain the tension between long blocks (less carryover, cleaner measurement) and short blocks (more blocks, more power). Specify any post-switch washout/burn-in you would discard and why, and describe how the rotation sequence is generated and constrained.
What This Part Should Cover
A concrete block length
decomposed into washout + measurement
, each justified by a specific empirical fact.
Explicit reasoning about the
240-minute periodicity
(why the block length should not divide it).
The
power vs. carryover trade-off
stated quantitatively (more blocks vs. cleaner blocks).
A
rotation rule
(balance target, run-length cap, how the sequence drifts across hours over days).
Part 2 — Diagnostics for residual carryover
Define the diagnostics you will run to detect residual carryover — i.e., contamination that survives your washout. Describe at least two complementary methods, the model/estimator behind each, and the specific signal that would tell you the washout is too short (and what you would do about it).
What This Part Should Cover
An
event-time / pre-post profile
around switch boundaries with the expected clean-vs-contaminated signatures.
A
lagged-treatment regression
spanning at least the dependence horizon, with an explicit null hypothesis on the lag coefficients.
A
balance / randomization check
(covariate standardized differences between T and C blocks) and a
residual-autocorrelation
check on block means.
A stated
remediation
when carryover is detected (increase washout/block length; switch to robust SEs).
Part 3 — Variance and sample-size computation
Provide the variance estimator and sample-size formula under block randomization. Give the explicit formulas, list the required inputs (and where each comes from), and show how within-block dependence and CUPED enter the calculation. A short numeric illustration of the method is welcome.
What This Part Should Cover
The
two-sample sample-size formula
and the matching
variance-of-the-effect
estimator at the block level.
A derivation of
between-block variance σb2
from unit-level variance, observations-per-block
m
, and within-block correlation
ρˉ
(a design-effect argument).
The
explicit list of required inputs
and their sources (historical baseline variance,
m
, ICC/ACF, MDE
δ
, CUPED
R2
,
α
, power).
Treatment of
residual block-level dependence
(HAC/Newey–West) and how
CUPED
reduces the required
n
.
What a Strong Answer Covers
Across all parts, a strong candidate ties every design choice back to the three empirical facts and the spillover structure, and keeps the estimand and unit of randomization (the block) consistent from design through analysis. Beyond the per-part rubrics, look for:
Time confounding handled by design, not luck
: an explicit randomization scheme across time-of-day and day-of-week (e.g., stratify by hour-of-week,
24×7=168
strata; rerandomize to balance T/C within strata) so neither arm is over-exposed to peaks or weekends.
Internal consistency
: the washout chosen in Part 1, the lag horizon tested in Part 2, and the ICC/ACF used in Part 3 all reference the
same
timescales coherently.
CUPED integrated correctly
: pre-period covariates at the same hour-of-week,
θ
estimated out-of-sample / via cross-fitting, variance-reduction
R2
flowed back into the power calculation.
Pragmatism and guardrails
: one arm live at a time, no overlapping experiments on the same queue, terminals sharing drivers treated as one cluster, and a pre-registered analysis plan (metric, SEs, seed).
Follow-up Questions
A standard
rider-split A/B test
is operationally simpler than a switchback. Given the spillovers described here, explain precisely
why
it is biased and when (if ever) you would still prefer it. (References the spillover premise across all parts.)
Suppose your Part 2 diagnostics show
statistically significant carryover
at lag 1 but the effect is small. Would you stop, lengthen the washout mid-run, or model it explicitly — and how does each choice affect the Part 3 power calculation?
How would you extend this single-airport design to
simultaneously test across many airports
, and what new sources of interference or heterogeneity would you have to control for?
If the treatment effect is suspected to be
heterogeneous by time-of-day
(e.g., larger during peaks), how would you change the analysis to estimate that, and what does it imply for the rollout decision?