PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/Analytics & Experimentation/Coinbase

Diagnose uplift drop in email A/B tests

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a data scientist's competence in experimental design, metric definition and guardrail selection, power and sample-size calculations, statistical inference (including two-proportion testing and fixed-effects meta-analysis), and debugging inconsistent A/B test reruns through instrumentation, population-shift, and heterogeneity checks. It is commonly asked because interviewers must assess the ability to operationalize randomized email experiments, set run lengths and attribution windows, and diagnose conflicting results using applied analytics; the problem sits in the Analytics & Experimentation domain and tests practical application grounded in conceptual statistical understanding.

  • hard
  • Coinbase
  • Analytics & Experimentation
  • Data Scientist

Diagnose uplift drop in email A/B tests

Company: Coinbase

Role: Data Scientist

Category: Analytics & Experimentation

Difficulty: hard

Interview Round: Onsite

An e-commerce company is testing personalized product emails to improve 7-day purchase conversion. Design the experiment and then debug conflicting rerun results. Part A — Design and sizing 1) Define a precise primary metric and 2–3 guardrails. Assume intent-to-treat on a user-level randomization. State exposure/eligibility rules and how to handle multiple emails per user. 2) Sample size: Baseline 7-day purchase conversion is 3.5%. Detect a 10% relative lift (two-sided α=0.05, power=0.80), 1:1 allocation. With 500,000 eligible users/day and 85% deliverability, how many calendar days must the test run (include a full 7-day attribution window)? Show the formula and numeric result. 3) Now suppose the business wants to power for 7-day revenue per randomized user (mean $0.90, SD $12.00). Detect a +$0.10 absolute lift with the same α and power. What per-arm sample size and run length does this imply? Part B — Conflicting results and diagnostics The initial RCT ran 2025-06-01 to 2025-06-14 with per-arm n=1,200,000. Control conversion=3.50%, Treatment=4.20% (+20.0% relative, +0.70 pp). A rerun on 2025-08-15 to 2025-08-28 with per-arm n=900,000 observed Control=3.50%, Treatment=3.57% (+2.0% relative, +0.07 pp). 4) For each test, compute the two-proportion z-test p-value and 95% CI for the absolute lift; then compute a fixed-effects meta-analytic pooled lift across the two tests. Should you launch? Why? 5) List at least 6 plausible causes for the discrepancy (e.g., seasonality, targeting drift, novelty/creative fatigue, regression to the mean/winner’s curse, instrumentation/attribution changes, concurrency with promos, contamination, different triggered eligibility, population mix-shift). For each, specify 1–2 concrete checks (SQL or plots) you would run and the exact data you’d need. 6) Propose a re-analysis plan: pre-registration, CUPED or pre-period covariate adjustment, heterogeneity-of-treatment-effects by region/device/recency, sequential monitoring corrections, and a holdout strategy for ramp. Describe decisions you would make if the pooled lift is between +0% and +5%.

Quick Answer: This question evaluates a data scientist's competence in experimental design, metric definition and guardrail selection, power and sample-size calculations, statistical inference (including two-proportion testing and fixed-effects meta-analysis), and debugging inconsistent A/B test reruns through instrumentation, population-shift, and heterogeneity checks. It is commonly asked because interviewers must assess the ability to operationalize randomized email experiments, set run lengths and attribution windows, and diagnose conflicting results using applied analytics; the problem sits in the Analytics & Experimentation domain and tests practical application grounded in conceptual statistical understanding.

Related Interview Questions

  • Design an Identity Trust Experiment - Coinbase (medium)
  • Design Identity-Trust A/B Test - Coinbase (medium)
  • Design Identity & Trust Experiment - Coinbase (medium)
  • Detect and quantify wash trading - Coinbase (hard)
  • Design KYC experiment amid crypto volatility - Coinbase (hard)
Coinbase logo
Coinbase
Oct 13, 2025, 9:49 PM
Data Scientist
Onsite
Analytics & Experimentation
2
0

Personalized Product Emails Experiment — Design, Sizing, and Debugging Conflicting Reruns

Context

An e-commerce company plans to A/B test personalized product emails to improve 7-day purchase conversion. Users will be randomized at the user level (intent-to-treat). Some users may receive multiple emails during the test window.

Part A — Design and Sizing

  1. Define:
    • A precise primary metric.
    • 2–3 guardrail metrics.
    • Exposure/eligibility rules.
    • How to handle multiple emails per user.
  2. Sample size for conversion:
    • Baseline 7-day purchase conversion = 3.5%.
    • Detect a 10% relative lift (two-sided α = 0.05, power = 0.80), 1:1 allocation.
    • There are 500,000 eligible users/day and 85% deliverability.
    • How many calendar days must the test run? Include a full 7-day attribution window. Show the formula and the numeric result.
  3. Sample size for revenue:
    • Power for 7-day revenue per randomized user (mean 0.90,SD0.90, SD 0.90,SD 12.00).
    • Detect a +$0.10 absolute lift with the same α and power.
    • What per-arm sample size and run length does this imply?

Part B — Conflicting Results and Diagnostics

The initial RCT ran 2025-06-01 to 2025-06-14 with per-arm n = 1,200,000. Control conversion = 3.50%, Treatment = 4.20% (+20.0% relative, +0.70 pp). A rerun on 2025-08-15 to 2025-08-28 with per-arm n = 900,000 observed Control = 3.50%, Treatment = 3.57% (+2.0% relative, +0.07 pp).

  1. For each test, compute the two-proportion z-test p-value and 95% CI for the absolute lift. Then compute a fixed-effects meta-analytic pooled absolute lift across the two tests. Should you launch? Why?
  2. List at least 6 plausible causes for the discrepancy (e.g., seasonality, targeting drift, novelty/creative fatigue, regression to the mean/winner’s curse, instrumentation/attribution changes, concurrency with promos, contamination, different triggered eligibility, population mix-shift). For each, specify 1–2 concrete checks (SQL or plots) you would run and the exact data you’d need.
  3. Propose a re-analysis plan: pre-registration, CUPED or pre-period covariate adjustment, heterogeneity-of-treatment-effects by region/device/recency, sequential monitoring corrections, and a holdout strategy for ramp. Describe decisions you would make if the pooled lift is between +0% and +5%.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Analytics & Experimentation•More Coinbase•More Data Scientist•Coinbase Data Scientist•Coinbase Analytics & Experimentation•Data Scientist Analytics & Experimentation
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.