Diagnose Discrepancy in A/B Test Conversion Rate Results
An e-commerce company plans to send personalized marketing emails to increase purchase conversions. An initial experiment showed a large lift, but after broader rollout a later test found a much smaller lift.
Constraints & Assumptions
-
Focus on rigorous experiment design and post-hoc diagnosis, not on building the personalization model itself.
-
Assume user-level email eligibility, send, open, click, purchase, unsubscribe, and revenue data are available.
-
Conversion lift should be measured causally against a control group.
-
Consider both statistical explanations and real product or implementation explanations.
Clarifying Questions to Ask
-
What was the original population, and how did it differ from the later rollout population?
-
Were send time, subject line, cadence, offer, and creative held constant?
-
Was assignment persistent at the user level?
-
Was the 20% lift relative or absolute, and over what conversion window?
Part 1 - Design the A/B Test
Design an A/B test to measure whether personalized emails increase conversion rate.
What This Part Should Cover
-
Unit of randomization, exposure rules, control and treatment definitions, and eligibility criteria.
-
Primary metric such as purchase conversion, plus secondary and guardrail metrics like revenue, unsubscribes, spam complaints, margin, and long-term retention.
-
Statistical test, analysis window, variance reduction, minimum detectable effect, sample size, and expected duration.
-
Instrumentation checks, sample-ratio mismatch checks, and pre-registration of success criteria.
Part 2 - Diagnose the Lift Discrepancy
After full rollout, a new director reruns the test and observes only a 2% lift instead of the original 20%. List plausible causes and the analyses you would run.
What This Part Should Cover
-
Differences in population, seasonality, campaign creative, offer, email deliverability, model version, product changes, and competitor or market conditions.
-
Statistical issues such as underpowered tests, novelty effects, peeking, multiple testing, regression to the mean, or sample-ratio mismatch.
-
Implementation problems such as treatment contamination, incorrect logging, duplicate sends, personalization not actually applied, or inconsistent attribution windows.
-
Segment analysis and reanalysis using the original experiment definition where possible.
What a Strong Answer Covers
A strong answer designs a clean user-level experiment, quantifies power and launch criteria, and diagnoses the later discrepancy with specific checks rather than vague speculation.
Follow-up Questions
-
How would you design a long-term holdout after rollout?
-
What if open rate increases but purchase conversion does not?
-
How would you explain relative versus absolute lift to a non-technical stakeholder?