Scenario
You work on a search product and have built a new search ranking/retrieval algorithm (Variant B). The current algorithm is Variant A. You need to design an online experiment to decide whether to launch B.
Task
Design an A/B test plan that covers:
-
Goal & hypotheses
-
What is the primary product goal (e.g., improved relevance, engagement, or long-term retention)?
-
State clear hypotheses (e.g., B improves relevance without harming latency).
-
Experiment design
-
Choose the
experimental unit
(user, device, session, query) and justify it.
-
Randomization approach (simple vs. stratified), and key stratification variables (e.g., locale, platform, query category).
-
Handling
interference/contamination
(e.g., cross-device users, cached results, shared accounts).
-
Duration and ramp plan (e.g., 1% → 10% → 50%), plus stopping rules.
-
Metrics
-
Propose a
primary metric
(one) and justify it.
-
Propose
diagnostic metrics
to understand
why
results change.
-
Propose
guardrail metrics
to prevent regressions.
Consider tradeoffs such as:
-
Short-term engagement vs. long-term user value
-
Relevance improvements vs.
latency / cost
-
Click metrics vs.
good clicks
(dwell time, reformulation)
-
Power / sample size
-
What inputs do you need to compute sample size (baseline rate, variance, MDE, alpha, power)?
-
How would you handle multiple comparisons if testing many metrics or segments?
-
Analysis plan
-
How will you compute treatment effects (difference in means/proportions; user-level aggregation)?
-
How will you check for
sample ratio mismatch (SRM)
and data quality issues?
-
What key segments would you examine (new vs. returning, head vs. tail queries), and how do you avoid p-hacking?
-
Risks & pitfalls
-
How do you address novelty effects, learning-to-rank feedback loops, or delayed outcomes?
-
What would make you decide to
not
trust the experiment result?
Output
Provide a structured experiment proposal (bulleted plan) including the final metric set and launch decision criteria.