This question evaluates a candidate's ability to design and analyze randomized experiments, covering statistical power and sample-size calculation, cluster-robust variance adjustments, covariate adjustment (CUPED), hypothesis and guardrail specification, ramping and monitoring strategies, and causal inference methods such as geo-level difference-in-differences, and is situated in the Analytics & Experimentation domain for data scientist roles. It is commonly asked to assess practical application of experimental statistics and operational decision-making under real-world constraints—balancing statistical rigor, multiple-testing control, and issues like repeat users, seasonality, and delayed attribution—requiring both conceptual understanding of causal inference and hands-on analytical application.

You are designing a 14-day, 50/50 user-level randomized A/B test for a marketplace's search ranking change that favors nearby merchants. Daily DAU ≈ 200,000. Baseline conversion p0 = 0.12. Expected relative lift = +5% on conversion. Guardrails: cancellation rate must not increase by > 0.2 percentage points (pp), and average delivery time must not worsen by > 3 minutes.
Answer the following:
(a) Compute the minimum per-arm sample size for conversion using a two-sided z-test with α = 0.05 and power = 0.80. Show formulas and numeric steps assuming independent Bernoulli trials. Then discuss how user clustering and repeat sessions/orders would inflate variance and how to correct (e.g., variance inflation factor via empirical design effect or cluster-robust SEs).
(b) Specify primary, secondary, and guardrail metrics; pre-register hypotheses; and define the decision rule combining effect size and statistical significance (include MDE and non-inferiority thresholds for guardrails).
(c) Describe a CUPED or pre-period covariate adjustment using user-level 28-day pretest conversion propensity; provide the adjusted estimator and how you would validate variance reduction (A/A test, placebo checks).
(d) Outline a ramp plan (1% → 10% → 50% → 100%), novelty and learning effects monitoring, weekday/seasonality controls, and how to handle attribution and tracking delays (48h late events).
(e) If the test is geo-split (city-level) instead of user-split, propose a difference-in-differences setup with city fixed effects and calendar effects; list assumptions and how you’d test for pre-trend balance.
(f) Explain how you will monitor and correct for peeking and multiple metrics (alpha-spending, O’Brien–Fleming; FDR for many guardrails) and define a rollback plan.
Login required