Diagnose uplift drop in email A/B tests

Q: Diagnose uplift drop in email A/B tests

This question evaluates a data scientist's competence in experimental design, metric definition and guardrail selection, power and sample-size calculations, statistical inference (including two-proportion testing and fixed-effects meta-analysis), and debugging inconsistent A/B test reruns through instrumentation, population-shift, and heterogeneity checks. It is commonly asked because interviewers must assess the ability to operationalize randomized email experiments, set run lengths and attribution windows, and diagnose conflicting results using applied analytics; the problem sits in the Analytics & Experimentation domain and tests practical application grounded in conceptual statistical understanding.

Q: How do I approach Analytics & Experimentation interview questions?

Analytics & Experimentation questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master analytics & experimentation interviews.

Question

Personalized Product Emails Experiment — Design, Sizing, and Debugging Conflicting Reruns

Context

An e-commerce company plans to A/B test personalized product emails to improve 7-day purchase conversion. Users will be randomized at the user level (intent-to-treat). Some users may receive multiple emails during the test window.

Part A — Design and Sizing

Define:
- A precise primary metric.
- 2–3 guardrail metrics.
- Exposure/eligibility rules.
- How to handle multiple emails per user.
Sample size for conversion:
- Baseline 7-day purchase conversion = 3.5%.
- Detect a 10% relative lift (two-sided α = 0.05, power = 0.80), 1:1 allocation.
- There are 500,000 eligible users/day and 85% deliverability.
- How many calendar days must the test run? Include a full 7-day attribution window. Show the formula and the numeric result.
Sample size for revenue:
- Power for 7-day revenue per randomized user (mean $0.90, SD$ 12.00).
- Detect a +$0.10 absolute lift with the same α and power.
- What per-arm sample size and run length does this imply?

Part B — Conflicting Results and Diagnostics

The initial RCT ran 2025-06-01 to 2025-06-14 with per-arm n = 1,200,000. Control conversion = 3.50%, Treatment = 4.20% (+20.0% relative, +0.70 pp). A rerun on 2025-08-15 to 2025-08-28 with per-arm n = 900,000 observed Control = 3.50%, Treatment = 3.57% (+2.0% relative, +0.07 pp).

For each test, compute the two-proportion z-test p-value and 95% CI for the absolute lift. Then compute a fixed-effects meta-analytic pooled absolute lift across the two tests. Should you launch? Why?
List at least 6 plausible causes for the discrepancy (e.g., seasonality, targeting drift, novelty/creative fatigue, regression to the mean/winner’s curse, instrumentation/attribution changes, concurrency with promos, contamination, different triggered eligibility, population mix-shift). For each, specify 1–2 concrete checks (SQL or plots) you would run and the exact data you’d need.
Propose a re-analysis plan: pre-registration, CUPED or pre-period covariate adjustment, heterogeneity-of-treatment-effects by region/device/recency, sequential monitoring corrections, and a holdout strategy for ramp. Describe decisions you would make if the pooled lift is between +0% and +5%.

Diagnose uplift drop in email A/B tests

Personalized Product Emails Experiment — Design, Sizing, and Debugging Conflicting Reruns

Context

Part A — Design and Sizing

Part B — Conflicting Results and Diagnostics

Solution

Comments (0)

Diagnose uplift drop in email A/B tests

Overview

Personalized Product Emails Experiment — Design, Sizing, and Debugging Conflicting Reruns

Context

Part A — Design and Sizing

Part B — Conflicting Results and Diagnostics

Solution

Comments (0)