Design a switchback and choose block length

Q: Design a switchback and choose block length

This question tests the ability to design a switchback (time-based A/B) experiment for a two-sided marketplace with spillovers and autocorrelated demand. It evaluates expertise in causal inference, experiment design under market-level treatment constraints, and variance reduction techniques such as CUPED — core competencies for data scientist roles in experimentation.

Q: How do I approach Analytics & Experimentation interview questions?

Analytics & Experimentation questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master analytics & experimentation interviews.

Q: What difficulty level is this interview question?

This is a hard difficulty Analytics & Experimentation question, commonly asked during Technical Screen rounds at Uber.

Q: What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Uber during technical interviews.

Question

Switchback Experiment Design: Airport Pickup Pricing with Spillovers

You are a data scientist designing a switchback (time-based A/B) experiment to evaluate a new airport pickup pricing algorithm in a two-sided ride-hailing marketplace. The airport has strong within-day seasonality and significant spillovers: drivers and the dispatch queue carry state across time, so a price change in one window affects supply and rider behavior in adjacent windows. Because of these spillovers, treatment must be assigned at the airport level, and only one arm (treatment or control) can be live at a time — you alternate the whole market between arms over time rather than splitting riders into two groups.

You are given the following empirical facts measured at this airport:

Median trip duration $\approx 22$ minutes.
Demand autocorrelation function (ACF) falls below $0.1$ at $\approx 75$ minutes (i.e., demand is meaningfully self-correlated for a little over an hour).
Strong periodic demand peaks every $\approx 4$ hours (driven by flight banks / arrival waves).

Design the experiment end to end. The question has three core parts plus two cross-cutting requirements: how you randomize across time-of-day and day-of-week, and how you use covariate adjustment (e.g., CUPED) to reduce variance.

Constraints & Assumptions

Single market, mutually exclusive arms. One airport queue; at any instant the market is either fully on treatment or fully on control. This rules out a standard user-split A/B test.
Spillover horizon. Carryover from one block can contaminate the next; the relevant timescales are trip duration ( $\approx 22$ min, how long the system takes to "flush" in-progress state) and the demand ACF horizon ( $\approx 75$ min).
No global control group. Effects are estimated from time contrasts between treatment and control blocks, so you must account for time-of-day and day-of-week confounding by design.
Primary metric is a continuous business outcome (e.g., revenue per request, completion rate, pickups per minute, or driver idle time) — analyze one pre-registered primary metric at a time.
Assume you have historical data at this airport (per-minute / per-request demand, pricing, completions, weather, flight schedules) to estimate variance components and fit covariate models before launch.

Clarifying Questions to Ask

What is the primary success metric and its decision direction? (Revenue per request vs. completion rate vs. driver idle time can imply different washout and power needs.)
What minimum detectable effect (MDE) matters for a launch decision, in absolute or relative terms?
How long can the experiment run (days/weeks), and is continuous operation acceptable, or are there blackout periods?
Are multiple terminals served by one shared driver pool? (If so they must be treated as a single cluster, not independent markets.)
Are any other pricing or dispatch experiments running concurrently at this airport that would interfere with the same queue?
What is the operational minimum block length the pricing/dispatch system can switch on without instability?

Part 1 — Block length and rotation schedule

Choose a block length and rotation schedule that minimizes carryover yet retains statistical power. Justify your block length quantitatively from the three empirical facts (trip duration, ACF horizon, 4-hour periodicity), and explain the tension between long blocks (less carryover, cleaner measurement) and short blocks (more blocks, more power). Specify any post-switch washout/burn-in you would discard and why, and describe how the rotation sequence is generated and constrained.

What This Part Should Cover

A concrete block length decomposed into washout + measurement , each justified by a specific empirical fact.
Explicit reasoning about the 240-minute periodicity (why the block length should not divide it).
The power vs. carryover trade-off stated quantitatively (more blocks vs. cleaner blocks).
A rotation rule (balance target, run-length cap, how the sequence drifts across hours over days).

Part 2 — Diagnostics for residual carryover

Define the diagnostics you will run to detect residual carryover — i.e., contamination that survives your washout. Describe at least two complementary methods, the model/estimator behind each, and the specific signal that would tell you the washout is too short (and what you would do about it).

What This Part Should Cover

An event-time / pre-post profile around switch boundaries with the expected clean-vs-contaminated signatures.
A lagged-treatment regression spanning at least the dependence horizon, with an explicit null hypothesis on the lag coefficients.
A balance / randomization check (covariate standardized differences between T and C blocks) and a residual-autocorrelation check on block means.
A stated remediation when carryover is detected (increase washout/block length; switch to robust SEs).

Part 3 — Variance and sample-size computation

Provide the variance estimator and sample-size formula under block randomization. Give the explicit formulas, list the required inputs (and where each comes from), and show how within-block dependence and CUPED enter the calculation. A short numeric illustration of the method is welcome.

What This Part Should Cover

The two-sample sample-size formula and the matching variance-of-the-effect estimator at the block level.
A derivation of between-block variance $\sigma_b^2$ from unit-level variance, observations-per-block $m$ , and within-block correlation $\bar\rho$ (a design-effect argument).
The explicit list of required inputs and their sources (historical baseline variance, $m$ , ICC/ACF, MDE $\delta$ , CUPED $R^2$ , $\alpha$ , power).
Treatment of residual block-level dependence (HAC/Newey–West) and how CUPED reduces the required $n$ .

What a Strong Answer Covers

Across all parts, a strong candidate ties every design choice back to the three empirical facts and the spillover structure, and keeps the estimand and unit of randomization (the block) consistent from design through analysis. Beyond the per-part rubrics, look for:

Time confounding handled by design, not luck : an explicit randomization scheme across time-of-day and day-of-week (e.g., stratify by hour-of-week, $24\times7=168$ strata; rerandomize to balance T/C within strata) so neither arm is over-exposed to peaks or weekends.
Internal consistency : the washout chosen in Part 1, the lag horizon tested in Part 2, and the ICC/ACF used in Part 3 all reference the same timescales coherently.
CUPED integrated correctly : pre-period covariates at the same hour-of-week, $\theta$ estimated out-of-sample / via cross-fitting, variance-reduction $R^2$ flowed back into the power calculation.
Pragmatism and guardrails : one arm live at a time, no overlapping experiments on the same queue, terminals sharing drivers treated as one cluster, and a pre-registered analysis plan (metric, SEs, seed).

Follow-up Questions

A standard rider-split A/B test is operationally simpler than a switchback. Given the spillovers described here, explain precisely why it is biased and when (if ever) you would still prefer it. (References the spillover premise across all parts.)
Suppose your Part 2 diagnostics show statistically significant carryover at lag 1 but the effect is small. Would you stop, lengthen the washout mid-run, or model it explicitly — and how does each choice affect the Part 3 power calculation?
How would you extend this single-airport design to simultaneously test across many airports , and what new sources of interference or heterogeneity would you have to control for?
If the treatment effect is suspected to be heterogeneous by time-of-day (e.g., larger during peaks), how would you change the analysis to estimate that, and what does it imply for the rollout decision?

Design a switchback and choose block length

Quick Overview

Design a switchback and choose block length

Switchback Experiment Design: Airport Pickup Pricing with Spillovers

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Block length and rotation schedule

What This Part Should Cover

Part 2 — Diagnostics for residual carryover

What This Part Should Cover

Part 3 — Variance and sample-size computation

What This Part Should Cover

What a Strong Answer Covers

Follow-up Questions

Write your answer

Design a switchback and choose block length

Quick Overview

Design a switchback and choose block length

Switchback Experiment Design: Airport Pickup Pricing with Spillovers

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — Block length and rotation schedule

What This Part Should Cover

Part 2 — Diagnostics for residual carryover

What This Part Should Cover

Part 3 — Variance and sample-size computation

What This Part Should Cover

What a Strong Answer Covers

Follow-up Questions

Write your answer