Evaluate a New Home-Page Ranking Algorithm: 3-Stage Plan
Context
You are introducing a new ranking algorithm for the home page. You must validate it safely and rigorously using a staged approach:
-
Offline counterfactual replay using IPS/DR.
-
Small-scale online interleaving (team-draft).
-
Full A/B experiment.
Be concrete about experiment unit, bucketing, sample size, guardrails, sequential testing, novelty/carryover/seasonality mitigation, ramp policy, proxy metrics and covariate adjustment, heterogeneous treatment effects (HTE) with multiple-testing control, and governance against p-hacking/Simpson’s paradox.
Tasks
-
Define the exposure unit (impression-level vs. session-level) and bucketing to avoid contamination across sessions/devices.
-
Primary metric: 30-day funded-account conversion per 1,000 impressions. Baseline = 1.20%, target relative uplift = +5%, power = 0.8, alpha = 0.05. Compute the per-arm sample size assuming independent impressions, then discuss inflation for repeated exposures and cluster-robust variance.
-
List guardrails (p95 latency, app crash rate, CS tickets, decline rate) and how you will set sequential boundaries (e.g., alpha spending or SPRT) to allow early stopping without inflating Type I error.
-
Explain how to mitigate novelty effects, carryover, and seasonality; specify the ramp policy and duration for capturing 30-day outcomes while using proxy metrics for early reads with CUPED or covariate adjustment.
-
Describe heterogeneous treatment effect analysis (new vs. existing users, credit tiers) and how you will control false discovery with BH or Holm.
-
Provide a plan to detect p-hacking/Simpson’s paradox and define ship criteria when primary and guardrails disagree.