Design and analyze a switchback experiment

Q: Design and analyze a switchback experiment

This question evaluates proficiency in experimental design, causal inference, randomization and contamination control, regression modeling with fixed effects and clustered standard errors, power calculation, and handling of compliance and robustness issues.

Q: How do I approach Analytics & Experimentation interview questions?

Analytics & Experimentation questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master analytics & experimentation interviews.

Q: What difficulty level is this interview question?

This is a hard difficulty Analytics & Experimentation question, commonly asked during Technical Screen rounds at DoorDash.

Q: What role is this question designed for?

This question is commonly asked for Data Scientist candidates at DoorDash during technical interviews.

Question

Design and Analyze a Switchback Experiment: Reducing Cold-Food Incidents for Bike Couriers

You are a data scientist on a delivery-marketplace team. A new feature is suspected to reduce cold-food incidents for bike couriers operating in dense urban zones. Because the feature acts on the marketplace itself (dispatch, batching, courier routing), a naive order-level A/B test would suffer interference: treating one order changes which courier is free for a neighboring control order.

Your task is to design a 2-week switchback experiment at the city level that toggles the feature ON/OFF in equal-length time slots within each city, then specify exactly how you would analyze it and decide whether to ship. Work through randomization, the assignment-vs-exposure distinction, the analysis model, a power calculation, diagnostics, and robustness.

Constraints & Assumptions

Experiment window: 14 days (2 weeks) per city.
Unit of toggling: the entire city for the duration of a time slot.
Order dynamics: average order lifecycle (accept → deliver) ≈ 45 minutes ; courier relocation / demand carryover horizon ≈ 30 minutes .
Outcome: a binary cold_food_incident flag, restricted to mode = bike deliveries.
Exposure flag: each delivery carries a condition_label ∈ {0, 1} indicating whether it was actually served under the feature.
Power inputs (Part D): baseline cold-food rate $p_0 = 6\%$ ; target relative reduction $10\%$ (MDE $= 0.6$ pp); average 120 eligible bike orders per slot ; slot-level intracluster correlation $\rho = 0.02$ ; two-sided $\alpha = 0.05$ ; power $= 0.80$ .

Clarifying Questions to Ask

Before designing, a candidate should scope the problem by asking:

What is the decision the experiment informs — a global ship/no-ship, or a per-city rollout? This sets the estimand.
Can we run multiple cities in parallel , or is this a single-city test? (Critical for power and for identifying time fixed effects.)
What historical (pre-period) data is available per city and per time-of-week, and how clean is it?
How is condition_label set operationally — is non-compliance (ON slots where the feature silently fails, or OFF slots that pick up spillover) common?
Are there known major events (storms, sports, holidays) in the 2-week window that we should anticipate?
Is there a secondary/guardrail set of metrics (delivery time, courier earnings, cancellations) that must not regress?

Part A — Randomization

Choose a slot length $L$ given the 45-minute order lifecycle and 30-minute relocation/carryover horizon, and justify it so as to minimize contamination across slot boundaries. Then describe a block-randomization scheme that balances day-of-week and peak hours while preventing the schedule from being predictable to operators.

What This Part Should Cover

A defended numeric choice of $L$ tied to the 75-min contamination horizon , with the trade-off (more switchbacks vs. cleaner cores) made explicit.
A concrete treatment of guard bands / wash-in / wash-out and which timestamp attributes a delivery to a slot.
A stratified block scheme that balances DOW × time-of-day and a separate mechanism (run-length caps, de-synchronization, blinding) that defeats predictability.

Part B — Assignment vs. Exposure

Define the difference between slot-level assignment (the Intention-to-Treat lever) and realized exposure when some deliveries in OFF slots pick up spillover demand from neighboring ON slots (or ON-slot deliveries silently fail to trigger the feature). Then specify the numerator and denominator for the primary metric (cold-food rate among biker deliveries), and give two denominator variants: (1) include all deliveries (condition_label = 0 and 1); (2) include only deliveries with condition_label = 1.

What This Part Should Cover

A clear, level-correct contrast: assignment is slot-level and pre-determined ; exposure is delivery-level and realized .
Why ITT (variant 1) is the unbiased, shippable policy effect and why variant 2 reintroduces selection bias by conditioning on a post-treatment variable.
Precise numerator/denominator definitions, including the bike-only and slot-core restrictions.

Part C — Analysis Model

Write the exact regression you would run (formula notation is fine) with city fixed effects and slot-of-week fixed effects, using cluster-robust standard errors at the city × slot level. Then explain how you would incorporate pre-period baselines or covariates (weather, surge, courier mix) to improve precision.

What This Part Should Cover

A correctly specified FE model (city FE, slot-of-week FE) with the treatment coefficient interpreted as the ITT effect in pp .
Clustering at the city × slot level justified by the assignment level / ICC.
A principled covariate / CUPED baseline strategy that reduces variance without biasing the effect, plus a note on pre-registration.

Part D — Power

Using the inputs above (baseline $6\%$ , MDE $0.6$ pp, 120 orders/slot, ICC $= 0.02$ , 14 days, $\alpha = 0.05$ , power $0.80$ ), estimate the number of switchbacks (ON↔OFF transitions per city) needed for 80% power. State your assumptions and show the core calculation or code.

What This Part Should Cover

Correct use of the design effect to deflate the per-slot sample to effective observations.
A two-proportion power calculation yielding slots per arm and a total slot budget, with stated assumptions (and any deliberate conservatism flagged).
The key realization that one city in 2 weeks is underpowered , leading to a parallel-cities (or extended-duration) design — and a correctly-scoped per-city transition count (bounded by that city's own slot count, not the experiment-wide total).

Part E — Diagnostics

List the concrete randomization / balance checks you will run, and describe how you would test for carryover (e.g., leading/lagging indicators, excluding boundary intervals).

What This Part Should Cover

Pre-registered balance tests : pre-period outcome balance, in-test covariate SMDs, stratum-balance verification.
A lead/lag (event-study) test for carryover, plus drop-the-boundary sensitivity and spillover mapping near city borders.
An explicit threshold or expectation for what "passes" (e.g., lag coefficient ≈ 0, ≥ ~90% of mass in slot cores).

Part F — Robustness

Describe how you would handle partial compliance, missing telemetry, and shocks (major events) mid-test, and give a principled decision rule to stop, extend, or rerun the experiment.

What This Part Should Cover

ITT-as-primary with IV/2SLS (not a naive conditioned ratio) for the effect on the treated.
A pre-registered missing-telemetry policy (MCAR vs. correlated; IPW / imputation; a hard drop threshold) and shock handling via an event calendar + covariates / exclusion.
A concrete stop / extend / rerun rule tied to compliance gap, carryover test results, realized power, and valid-inference constraints on peeking.

What a Strong Answer Covers

Across all parts, the interviewer is looking for an experimentalist who treats the switchback as an interference-control device and reasons end-to-end:

Estimand discipline: ITT as the primary, shippable effect throughout; exposure-conditioned quantities recovered only via IV, never as naive ratios.
Internal consistency: the slot length, guard bands, clustering level, FE structure, and power calculation all reference the same unit of assignment (city × slot) and the same contamination horizon.
Quantitative correctness: the design effect, two-proportion math, and the slots → cities → per-city-transitions chain are arithmetically sound and the conservatisms are named.
Pre-registration mindset: covariates, missingness thresholds, decision rules, and any interim looks are committed in advance to protect $\alpha$ .

Follow-up Questions

If the realized first stage is weak (ON and OFF slots show nearly identical exposure), what does that imply for both the ITT and the IV estimates, and how would you respond?
Suppose the lead/lag test shows a significant lag coefficient. Walk through exactly what you change in the next iteration of the design.
You observe that cold-food rate has strong time-of-day heterogeneity (much worse at dinner peak). How would you estimate and report heterogeneous treatment effects without inflating false positives?
Marketing wants to ship after one week because the point estimate "looks good." Explain precisely why that is unsafe under your design and what inference procedure would make an early look legitimate.

Design and analyze a switchback experiment

Quick Overview

Design and analyze a switchback experiment

Design and analyze a switchback experiment

Design and Analyze a Switchback Experiment: Reducing Cold-Food Incidents for Bike Couriers

Constraints & Assumptions

Clarifying Questions to Ask

Part A — Randomization

What This Part Should Cover

Part B — Assignment vs. Exposure

What This Part Should Cover

Part C — Analysis Model

What This Part Should Cover

Part D — Power

What This Part Should Cover

Part E — Diagnostics

What This Part Should Cover

Part F — Robustness

What This Part Should Cover

What a Strong Answer Covers

Follow-up Questions

Write your answer

Design and analyze a switchback experiment

Quick Overview

Design and analyze a switchback experiment

Design and analyze a switchback experiment

Design and Analyze a Switchback Experiment: Reducing Cold-Food Incidents for Bike Couriers

Constraints & Assumptions

Clarifying Questions to Ask

Part A — Randomization

What This Part Should Cover

Part B — Assignment vs. Exposure

What This Part Should Cover

Part C — Analysis Model

What This Part Should Cover

Part D — Power

What This Part Should Cover

Part E — Diagnostics

What This Part Should Cover

Part F — Robustness

What This Part Should Cover

What a Strong Answer Covers

Follow-up Questions

Write your answer