A marketing team wants to evaluate a new email campaign. Two email versions, A and B, were tested over two weeks in two cities: San Francisco and New York. The overall pooled data suggests version A has a higher conversion rate than version B, but when you break the results down by city, and possibly by week, version B appears better in every subgroup.
How would you determine whether the new email is actually better? In your answer, discuss:
-
what metric(s) you would define as primary and which guardrail metrics you would monitor,
-
how Simpson's paradox can arise in this setting,
-
what confounders or sources of imbalance you would check,
-
how you would re-analyze the existing data,
-
whether and how you can compute confidence intervals, and
-
how you would redesign the experiment if the original allocation across cities or time was imbalanced.