A/B Test Paradox Across Two Cities
You ran an A/B test in two geographies, City X and City Y. Within each city, variant A outperforms variant B. However, when the data are pooled across cities, the combined result shows variant B performing better than A.
Interpret the conflicting results, decide what you would do for rollout, and describe additional analysis needed before rollout.
Constraints & Assumptions
-
Treat city as a possible confounder or stratification variable.
-
Check whether randomization was balanced within city.
-
Define the rollout estimand before choosing aggregate or city-level results.
-
Consider heterogeneous treatment effects and data-quality issues.
Clarifying Questions to Ask
-
Was randomization stratified by city or global across both cities?
-
Are baseline outcomes very different between the two cities?
-
Are traffic weights expected to match future rollout traffic?
-
Are sample sizes large enough within each city?
Part 1 - Interpretation
How would you interpret the conflicting results?
What This Part Should Cover
-
Recognize Simpson's paradox or aggregation bias.
-
Explain how different city traffic weights can reverse the pooled result.
-
Check SRM, assignment balance, logging, and confidence intervals.
-
Distinguish statistical contradiction from different estimands.
Part 2 - Rollout Decision
What decision would you make for rollout?
What This Part Should Cover
-
Avoid launching solely based on the pooled result when city is a confounder.
-
Prefer stratified or pre-specified weighted estimates aligned with rollout traffic.
-
Consider city-specific rollout if effects or risks differ.
-
Require guardrail checks and sufficient power.
Part 3 - Additional Analysis
What additional analysis is needed before rollout?
What This Part Should Cover
-
Estimate treatment effects by city with confidence intervals.
-
Use stratified analysis, fixed effects, meta-analysis, or regression with city controls.
-
Check sample ratio mismatch, covariate balance, experiment exposure, and metric definitions.
-
Examine whether future traffic mix will resemble the test.
Follow-up Questions
-
How would you explain this to a product manager?
-
What if city-level results are positive but underpowered?
-
How would the decision change if traffic mix will shift after launch?