A/B Test a New Feed-Ranking Algorithm
A social-media company wants to evaluate a new feed-ranking algorithm intended to increase daily active minutes per user.
Constraints & Assumptions
-
Use a randomized A/B test when user-level randomization is available.
-
If rollout is geography-based, discuss causal inference alternatives.
-
Daily active minutes are likely skewed and may require variance reduction or robust checks.
-
Cover hypothesis, metrics, sample size, experiment health, and interpretation of time-series behavior.
Clarifying Questions to Ask
-
Is the ranking change static during the experiment, or does it learn online?
-
Is the primary metric minutes per active user-day, per assigned user, or per session?
-
What minimum lift is worth shipping?
-
Are there network effects or creator-side spillovers?
What a Strong Answer Covers
-
Hypotheses for the A/B test and a primary metric such as daily active minutes per user.
-
Guardrails: retention, DAU/MAU, crash rate, latency, negative feedback, content diversity, ads revenue, and integrity reports.
-
Sample-size formula for two-sample mean comparison, using alpha, power, standard deviation, and minimal detectable effect.
-
Discussion of practical significance, two-tailed alpha, and 95% power.
-
Experiment health checks: SRM, covariate balance, pre-period parallel trends, logging completeness, data loss, treatment exposure, and overlapping experiments.
-
Diagnosis of a mid-test dip: outage, logging issue, traffic mix shift, release, novelty/fatigue, external event, or ramp change.
-
Geo-based rollout analysis using difference-in-differences, event study, synthetic control, pre-trend checks, and cluster-robust uncertainty.
Follow-up Questions
-
How would you choose the MDE?
-
What if average minutes rise but negative feedback also rises?
-
Why might geography rollout violate parallel trends?
-
How would CUPED reduce sample size?