Email campaign experiment with Simpson’s paradox
A marketing team tests a new email variant B vs control A.
The experiment ran for two weeks in two cities (e.g., SF and NY). When you look within each city-week segment, variant B appears to outperform A. But when you aggregate all data together, variant A appears to outperform B.
Questions
-
Explain how this situation can occur (
Simpson’s paradox
) in the context of this experiment.
-
How would you determine which email is actually “better” for a product decision?
-
What is your
primary metric
?
-
What diagnostic/guardrail metrics would you check?
-
What checks would you run to detect issues like
imbalance
,
confounding
, or
time effects
?
-
Can you compute a
confidence interval
(or significance test) for the effect in a way that is robust to the paradox? If yes, how?
-
If you suspect the paradox is caused by experimental flaws (e.g., allocation imbalance), what would you recommend doing next?