Experiment Aggregation Bias and Heterogeneity: Simpson's Paradox, Robust Estimation, and Decisioning
Context
You ran a randomized experiment measuring conversion rate uplift. The pooled (aggregate) analysis shows +1.2 percentage points (pp) uplift. When stratifying by user segment, the treatment effects are −0.5pp for New users and +2.0pp for Returning users.
Your goal:
-
Explain how differences in segment mix across arms can produce aggregation bias (often referred to as Simpson’s paradox in this context).
-
Pre-register an analysis plan that protects against this bias.
-
Specify how to test for treatment heterogeneity and how that translates to a rollout decision.
Tasks
(a) Construct a concrete numeric example where the overall uplift is +1.2pp even though the segment-level effects are −0.5pp (New) and +2.0pp (Returning), due to a shift in segment mix.
(b) Pre-register a stratified estimator (or MMRM alternative) to estimate a population-average effect that is robust to aggregation bias.
(c) Specify an interaction test for heterogeneity and a decision rule (segment-specific rollout vs. global ship) when heterogeneity is material.