Ranking A/B Tests And Online Evaluation

What's being tested

Interviewers are probing whether you can evaluate changes to ranking systems where user experience, business value, and statistical validity collide. For Meta, ranking changes in Feed, Reels, Search, Groups, Marketplace, or Ads can affect billions of impressions, so a “small” metric movement may be economically and socially meaningful. The skill is not just knowing A/B testing definitions; it is choosing the right randomization unit, metrics, guardrails, analysis window, and launch decision under interference, novelty effects, and heavy-tailed engagement data. A strong answer shows you can connect ranking-model performance to product outcomes without overclaiming causality from noisy online experiments.

Core knowledge

Ranking systems are usually multi-stage: candidate generation, lightweight filtering, heavy scoring, re-ranking, and policy/business-rule layers. A/B tests may target one stage only, so clarify whether the treatment changes recall, score calibration, diversity constraints, or final ordering.
Offline ranking metrics like NDCG, MAP, MRR, AUC, and log loss are useful for model iteration but are not launch metrics. For example, $NDCG@k = \frac{DCG@k}{IDCG@k}, \quad DCG@k=\sum_{i=1}^k \frac{rel_i}{\log_2(i+1)}$ They depend on labels and logging policy, and may not predict long-term user value.
Online metrics should map to the product objective: Feed might use meaningful interactions, sessions, retention, hides/reports, dwell quality; Ads might use revenue, advertiser value, conversions, user ad load tolerance; Reels might use watch time, completion, follows, skips, and return rate. Always separate primary metrics from guardrails.
User-level randomization is usually preferred for ranking tests because impressions within a user are correlated and treatment changes future behavior. Impression-level randomization gives more samples but can contaminate experience, violate SUTVA, and bias long-term metrics like retention or creator ecosystem effects.
Core effect estimate: $\hat{\Delta}=\bar{Y}_T-\bar{Y}_C$ or relative lift $\frac{\bar{Y}_T-\bar{Y}_C}{\bar{Y}_C}.$ For ratio metrics like CTR = clicks/impressions, use delta method, bootstrap, or user-level aggregation rather than treating each impression as independent.
Power depends on variance, detectable effect, and traffic: $n \approx \frac{2(z_{1-\alpha/2}+z_{1-\beta})^2\sigma^2}{\delta^2}.$ Ranking metrics are often heavy-tailed; winsorization, log transforms, bootstrap CIs, or robust user-level metrics may be needed.
CUPED and covariate adjustment can materially reduce variance when pre-period behavior predicts post-period behavior. CUPED uses $Y' = Y - \theta(X-\bar{X}), \quad \theta=\frac{Cov(Y,X)}{Var(X)}$ where $X$ is a pre-treatment metric. It improves sensitivity but does not fix biased randomization.
Always check experiment health before interpreting results: sample ratio mismatch, logging loss, bot/spam shifts, platform imbalance, treatment leakage, metric definition changes, and ramp anomalies. A significant lift is not credible if assignment or instrumentation is broken.
Ranking A/B tests often have interference: one user’s treatment can affect another user’s feed, creator distribution, marketplace liquidity, or auction prices. For strong interference, consider cluster randomization, geo experiments, switchback designs, or ecosystem-level guardrails, accepting lower power.
Short-term engagement can conflict with long-term value. A ranker may increase watch time by promoting clickbait, outrage, or addictive loops while hurting retention, surveys, hides, reports, creator health, or integrity metrics. Meta-style evaluation should include quality and safety guardrails, not just engagement.
Multiple testing is common because rankers affect many slices: country, platform, new users, power users, content type, creator tier. Use pre-registered primary metrics, false discovery rate control, Bonferroni/Holm when appropriate, and treat exploratory segment wins as hypotheses for follow-up tests.
Sequential monitoring needs explicit rules. Peeking every day at $p<0.05$ inflates false positives. Use alpha spending, group sequential designs, always-valid confidence sequences, or commit to fixed analysis windows with ramp stages such as 1%, 5%, 10%, 50%, 100%.

Worked example

Evaluate a News Feed ranking change

In the first 30 seconds, frame the problem by asking what changed: candidate retrieval, model score, re-ranker, or business rule; what the intended product goal is; and whether the change affects all users, a market, or a content type. Then declare that you would run a user-level randomized A/B test because Feed ranking changes alter repeated user experiences and future behavior, making impression-level independence unrealistic. The answer can be organized around four pillars: experiment design, metric framework, statistical analysis, and launch decision.

For design, specify treatment/control allocation, ramp plan, minimum duration covering weekday/weekend cycles, and health checks such as sample ratio mismatch and logging parity. For metrics, choose a primary metric aligned with value, such as meaningful interactions per user or quality-adjusted engagement, plus guardrails like hides, reports, session frequency, retention, survey satisfaction, and integrity violations. For analysis, aggregate at user level, report absolute and relative lift with confidence intervals, use CUPED if pre-period engagement is available, and slice by platform, country, and user tenure only after confirming the overall result. A key tradeoff to flag is that optimizing short-term engagement may harm long-term satisfaction, so you would not launch on watch time alone if negative-feedback or retention guardrails degrade. Close by saying that if you had more time, you would examine heterogeneous effects, creator/content ecosystem impact, and whether offline NDCG improvements actually correlated with online user-value gains.

A second angle

Evaluate a new Ads ranking model

The same evaluation logic applies, but the objective and interference constraints differ because ads ranking sits inside an auction. You would still prefer randomized online testing and user-level aggregation, but metrics now include advertiser value, conversions, revenue, click-through rate, conversion rate, cost per action, ad relevance, and user-experience guardrails such as ad hides or session impact. A major difference is that treatment can affect auction prices and advertiser budgets, so interference across users and advertisers is more serious than in a simple content-ranking test. You may need advertiser-level or geo-level analyses, budget pacing checks, and careful interpretation of revenue lift if the treatment merely shifts spend across campaigns rather than creating incremental value.

Common pitfalls

Analytical mistake: treating impressions as independent. A tempting answer is “we have millions of impressions, so the test will be highly powered.” In ranking systems, impressions from the same user are correlated, and treatment can change how many impressions the user generates; aggregate to the randomization unit or use cluster-robust methods.

Communication mistake: jumping straight to p-values. Saying “launch if the p-value is below 0.05” sounds statistically aware but product-naive. A better answer defines the decision rule around practical significance, guardrails, experiment health, confidence intervals, and whether the observed lift is consistent across critical user segments.

Depth mistake: optimizing only the obvious engagement metric. For Feed or Reels, “increase CTR/watch time” is not sufficient and can even be harmful. Strong candidates discuss long-term retention, satisfaction surveys, negative feedback, integrity, diversity, creator effects, and whether the model is exploiting label bias or clickbait.

Connections

Expect pivots into causal inference, especially interference/SUTVA violations, heterogeneous treatment effects, and long-term treatment effects. Interviewers may also move toward metric design, power analysis, sequential testing, counterfactual evaluation for ranking systems, or marketplace/auction experimentation.