This question evaluates proficiency in experimental design, causal inference, randomization and contamination control, regression modeling with fixed effects and clustered standard errors, power calculation, and handling of compliance and robustness issues.
You are optimizing a delivery marketplace feature suspected to reduce cold-food incidents for bike couriers in dense zones. Design a 2-week switchback experiment at the city level that toggles the feature ON/OFF by equal-length time slots within each city. Be precise and address: (A) Randomization: Choose a slot length L given an average order lifecycle of 45 minutes and a driver relocation/carryover horizon of 30 minutes. Justify L to minimize contamination and describe a block-randomization scheme that balances day-of-week and peak hours while preventing predictability. (B) Assignment vs exposure: Define the difference between slot-level assignment (Intention-to-Treat) and realized exposure when some units operate in OFF slots but pick up spillover demand from neighboring ON slots. Specify what goes in the numerator/denominator for the primary metric (cold-food rate among biker deliveries), and show two denominator variants: include all deliveries (condition_label=0 and 1) vs include only deliveries with condition_label=1. (C) Analysis model: Write the exact regression you would run (formula notation is fine) with city fixed effects and slot-of-week fixed effects, and cluster-robust SEs at the city×slot level. Explain how you would incorporate pre-period baselines or covariates (e.g., weather, surge, courier mix) for precision. (D) Power: With baseline cold-food rate = 6%, target relative reduction = 10% (MDE = 0.6pp), average 120 eligible orders per slot, intracluster correlation (ICC) at the slot level = 0.02, and 14 days, estimate the number of switchbacks (ON↔OFF transitions per city) needed for 80% power at α=0.05. State assumptions and show the core calculation or code you would use. (E) Diagnostics: List concrete randomization checks and balance tests you will run, and how you would test for carryover (e.g., leading indicators, excluding boundary intervals). (F) Robustness: How would you handle partial compliance, missing telemetry, or shocks (major events) mid-test? Describe a principled decision rule to stop, extend, or rerun the test.
Quick Answer: This question evaluates proficiency in experimental design, causal inference, randomization and contamination control, regression modeling with fixed effects and clustered standard errors, power calculation, and handling of compliance and robustness issues.
Design and Analyze a Switchback Experiment: Reducing Cold-Food Incidents for Bike Couriers
You are a data scientist on a delivery-marketplace team. A new feature is suspected to reduce cold-food incidents for bike couriers operating in dense urban zones. Because the feature acts on the marketplace itself (dispatch, batching, courier routing), a naive order-level A/B test would suffer interference: treating one order changes which courier is free for a neighboring control order.
Your task is to design a 2-week switchback experiment at the city level that toggles the feature ON/OFF in equal-length time slots within each city, then specify exactly how you would analyze it and decide whether to ship. Work through randomization, the assignment-vs-exposure distinction, the analysis model, a power calculation, diagnostics, and robustness.
Constraints & Assumptions
Experiment window:
14 days (2 weeks) per city.
Unit of toggling:
the entire city for the duration of a time slot.
Order dynamics:
average order lifecycle (accept → deliver) ≈
45 minutes
; courier relocation / demand carryover horizon ≈
30 minutes
.
Outcome:
a binary
cold_food_incident
flag, restricted to
mode = bike
deliveries.
Exposure flag:
each delivery carries a
condition_label ∈ {0, 1}
indicating whether it was
actually
served under the feature.
Power inputs (Part D):
baseline cold-food rate
p0=6%
; target relative reduction
10%
(MDE
=0.6
pp); average
120 eligible bike orders per slot
; slot-level intracluster correlation
ρ=0.02
; two-sided
α=0.05
; power
=0.80
.
Clarifying Questions to Ask
Before designing, a candidate should scope the problem by asking:
What is the
decision
the experiment informs — a global ship/no-ship, or a per-city rollout? This sets the estimand.
Can we run
multiple cities in parallel
, or is this a single-city test? (Critical for power and for identifying time fixed effects.)
What
historical (pre-period) data
is available per city and per time-of-week, and how clean is it?
How is
condition_label
set operationally — is non-compliance (ON slots where the feature silently fails, or OFF slots that pick up spillover) common?
Are there
known major events
(storms, sports, holidays) in the 2-week window that we should anticipate?
Is there a
secondary/guardrail
set of metrics (delivery time, courier earnings, cancellations) that must not regress?
Part A — Randomization
Choose a slot length L given the 45-minute order lifecycle and 30-minute relocation/carryover horizon, and justify it so as to minimize contamination across slot boundaries. Then describe a block-randomization scheme that balances day-of-week and peak hours while preventing the schedule from being predictable to operators.
What This Part Should Cover
A defended numeric choice of
L
tied to the
75-min contamination horizon
, with the trade-off (more switchbacks vs. cleaner cores) made explicit.
A concrete treatment of
guard bands / wash-in / wash-out
and which timestamp attributes a delivery to a slot.
A
stratified block scheme
that balances DOW × time-of-day and a separate mechanism (run-length caps, de-synchronization, blinding) that defeats predictability.
Part B — Assignment vs. Exposure
Define the difference between slot-level assignment (the Intention-to-Treat lever) and realized exposure when some deliveries in OFF slots pick up spillover demand from neighboring ON slots (or ON-slot deliveries silently fail to trigger the feature). Then specify the numerator and denominator for the primary metric (cold-food rate among biker deliveries), and give two denominator variants: (1) include all deliveries (condition_label = 0 and 1); (2) include only deliveries with condition_label = 1.
What This Part Should Cover
A clear, level-correct contrast: assignment is
slot-level and pre-determined
; exposure is
delivery-level and realized
.
Why ITT (variant 1) is the unbiased, shippable policy effect and why variant 2 reintroduces
selection bias
by conditioning on a post-treatment variable.
Precise numerator/denominator definitions, including the
bike-only
and
slot-core
restrictions.
Part C — Analysis Model
Write the exact regression you would run (formula notation is fine) with city fixed effects and slot-of-week fixed effects, using cluster-robust standard errors at the city × slot level. Then explain how you would incorporate pre-period baselines or covariates (weather, surge, courier mix) to improve precision.
What This Part Should Cover
A correctly specified FE model (city FE, slot-of-week FE) with the treatment coefficient interpreted as the
ITT effect in pp
.
Clustering at the
city × slot
level justified by the assignment level / ICC.
A principled covariate /
CUPED baseline
strategy that reduces variance
without
biasing the effect, plus a note on pre-registration.
Part D — Power
Using the inputs above (baseline 6%, MDE 0.6 pp, 120 orders/slot, ICC =0.02, 14 days, α=0.05, power 0.80), estimate the number of switchbacks (ON↔OFF transitions per city) needed for 80% power. State your assumptions and show the core calculation or code.
What This Part Should Cover
Correct use of the
design effect
to deflate the per-slot sample to effective observations.
A two-proportion power calculation yielding
slots per arm
and a total slot budget, with stated assumptions (and any deliberate conservatism flagged).
The key realization that one city in 2 weeks is
underpowered
, leading to a parallel-cities (or extended-duration) design — and a correctly-scoped
per-city transition count
(bounded by that city's own slot count, not the experiment-wide total).
Part E — Diagnostics
List the concrete randomization / balance checks you will run, and describe how you would test for carryover (e.g., leading/lagging indicators, excluding boundary intervals).
A
lead/lag (event-study) test
for carryover, plus drop-the-boundary sensitivity and spillover mapping near city borders.
An explicit threshold or expectation for what "passes" (e.g., lag coefficient ≈ 0, ≥ ~90% of mass in slot cores).
Part F — Robustness
Describe how you would handle partial compliance, missing telemetry, and shocks (major events) mid-test, and give a principled decision rule to stop, extend, or rerun the experiment.
What This Part Should Cover
ITT-as-primary with
IV/2SLS
(not a naive conditioned ratio) for the effect on the treated.
A
pre-registered
missing-telemetry policy (MCAR vs. correlated; IPW / imputation; a hard drop threshold) and shock handling via an event calendar + covariates / exclusion.
A concrete
stop / extend / rerun
rule tied to compliance gap, carryover test results, realized power, and valid-inference constraints on peeking.
What a Strong Answer Covers
Across all parts, the interviewer is looking for an experimentalist who treats the switchback as an interference-control device and reasons end-to-end:
Estimand discipline:
ITT as the primary, shippable effect throughout; exposure-conditioned quantities recovered only via IV, never as naive ratios.
Internal consistency:
the slot length, guard bands, clustering level, FE structure, and power calculation all reference the
same
unit of assignment (city × slot) and the
same
contamination horizon.
Quantitative correctness:
the design effect, two-proportion math, and the slots → cities → per-city-transitions chain are arithmetically sound and the conservatisms are named.
Pre-registration mindset:
covariates, missingness thresholds, decision rules, and any interim looks are committed in advance to protect
α
.
Follow-up Questions
If the realized first stage is weak (ON and OFF slots show nearly identical exposure), what does that imply for both the ITT and the IV estimates, and how would you respond?
Suppose the lead/lag test shows a significant
lag
coefficient. Walk through exactly what you change in the
next
iteration of the design.
You observe that cold-food rate has strong
time-of-day heterogeneity
(much worse at dinner peak). How would you estimate and report heterogeneous treatment effects without inflating false positives?
Marketing wants to ship after one week because the point estimate "looks good." Explain precisely why that is unsafe under your design and what inference procedure would make an early look legitimate.