Ranking/Retrieval A/B Testing At Meta

What's being tested

Interviewers are probing whether you can design and interpret experiments for ranking and retrieval systems where the product surface is personalized, multi-objective, and highly sensitive to metric choice. The core skill is not reciting A/B testing definitions; it is knowing how to connect model changes to user experience, business outcomes, and system constraints. At Meta, ranking changes affect Feed, Reels, Stories, Ads, Search, Marketplace, Notifications, and recommendations, so a Data Scientist must reason about engagement, quality, integrity, creator ecosystem effects, and long-term retention. Strong answers show you can separate offline model quality from online causal impact, choose the right unit of randomization, detect biased logging or exposure issues, and make a launch recommendation under uncertainty.

Core knowledge

Ranking systems usually have at least two stages: retrieval/candidate generation and ranking. Retrieval optimizes coverage and latency, often using approximate nearest neighbor search such as FAISS, HNSW, IVF-PQ, or two-tower embeddings. Ranking then scores a smaller candidate set with heavier models, e.g. gradient-boosted trees, DNNs, or multi-task neural networks.
Offline ranking metrics are useful but insufficient. Common metrics include AUC, log loss, NDCG@ $k$ , MAP, MRR, recall@ $k$ , and calibration error. For ranking:
$DCG@k=\sum_{i=1}^k \frac{rel_i}{\log_2(i+1)}, \quad NDCG@k=\frac{DCG@k}{IDCG@k}$
Offline gains often fail online due to feedback loops, position bias, logging policy differences, and distribution shift.
Retrieval experiments should distinguish “better candidates” from “better ranked outcomes.” Candidate generators are often evaluated by recall@ $K$ against a stronger teacher/ranker or historical positives:
$Recall@K=\frac{\#\text{relevant items retrieved in top }K}{\#\text{relevant items}}$
But online impact depends on whether the final ranker can use those candidates and whether latency or diversity worsens.
Primary metrics should map to the product goal: session value, meaningful interactions, watch time, click-through rate, conversion rate, retention, or revenue. At Meta, a ranking test often needs guardrails for hide/report rate, negative feedback, integrity prevalence, latency, crash rate, ad load, creator distribution, and user-level long-term outcomes.
Randomization unit matters. User-level randomization is standard for personalized ranking because impressions within a user are correlated. Item-level randomization can contaminate treatment if the same post/reel/ad appears to both treatment and control users. Cluster-level or geo-level designs may be needed when network effects or marketplace liquidity create interference.
The basic treatment effect estimate is:
$\hat{\Delta}=\bar{Y}_T-\bar{Y}_C$
with standard error
$SE(\hat{\Delta})=\sqrt{\frac{s_T^2}{n_T}+\frac{s_C^2}{n_C}}$
But for heavy-tailed metrics like watch time or revenue, use winsorization, ratio-metric delta method, bootstrap, or user-level aggregation before inference.
Ranking experiments are vulnerable to sample ratio mismatch, logging bugs, and exposure bias. Always check assignment balance, eligibility rates, impression volume, latency, missing events, and whether treatment changes who becomes observable. A treatment that retrieves fewer candidates may reduce measured negative feedback simply because fewer items are shown.
Multiple metrics create false positives. If testing many slices, surfaces, or objectives, control error via pre-registered primary metrics, Benjamini-Hochberg FDR, Bonferroni/Holm for stricter control, or hierarchical decision rules. Avoid “metric shopping” after seeing the dashboard.
Use variance reduction when possible. CUPED adjusts outcomes using pre-experiment covariates:
$Y_i^{adj}=Y_i-\theta(X_i-\bar{X}), \quad \theta=\frac{Cov(Y,X)}{Var(X)}$
It is especially useful for stable user-level metrics like historical engagement, but less effective for new users or novel surfaces with sparse histories.
Sequential monitoring requires discipline. Peeking daily and launching on the first significant result inflates Type I error. Use fixed-horizon analysis, alpha-spending, group sequential methods, or always-valid confidence sequences. Many Meta-style experiments ramp from 1% to 5% to 50% to 100% while watching pre-defined guardrails.
Ranking changes can produce ecosystem and long-term effects not captured in short tests. A Reels ranking change may improve viewer watch time while harming creator diversity; a Feed change may increase comments but also conflict or reports. Track distributional metrics across user cohorts, creators, content types, countries, and new versus tenured users.
Practical launch decisions combine statistics, product judgment, and engineering risk. A tiny statistically significant CTR lift may not justify added ranking latency, infra cost, or integrity risk. Conversely, a neutral short-term engagement result may launch if it improves quality, reduces harm, or supports a strategic objective with strong guardrails.

Worked example

How would you A/B test a new Feed ranking algorithm?

A strong candidate should first clarify the scope: is this a replacement for the final ranker, a new feature in the ranker, or a reweighting of existing objectives? They should ask what the intended goal is — more meaningful social interactions, time spent, retention, reduced negative feedback, or some multi-objective value function — and whether the change affects all Feed inventory or only a subset such as friend posts, Groups, or recommended content. The answer can be organized around five pillars: experiment design, metric framework, instrumentation and data quality, statistical analysis, and launch decision.

For design, propose user-level randomization among eligible Feed users, with a ramp plan starting small to detect severe regressions in latency, crash rate, hide/report rate, or integrity prevalence. For metrics, choose one primary success metric tied to the stated goal, such as meaningful interactions per daily active user or long-term session value, plus guardrails like negative feedback rate, content diversity, notification opens, retention, and server-side latency. For analysis, aggregate to the user level, use pre-period engagement for CUPED if available, and segment by new users, heavy users, country, device class, and content source.

One explicit tradeoff to flag is engagement versus quality: optimizing for clicks or comments can over-rank sensational content, so the launch decision should not rely only on short-term engagement. The candidate should also mention interference: Feed is social, so one user’s treatment may affect what friends see or how much they post, making pure SUTVA assumptions imperfect. A good close would be: “If I had more time, I would validate offline replay results against online outcomes, run longer-term holdouts for retention and ecosystem effects, and inspect model behavior on sensitive slices before full launch.”

A second angle

How would you test a new retrieval candidate generator for Reels?

The same experimentation logic applies, but the constraints shift toward candidate coverage, freshness, latency, and downstream ranker compatibility. Here, offline evaluation is more central before launch: compare recall@ $K$ , diversity, embedding-neighbor quality, and overlap with the existing generator, then run an online A/B test where the new retriever contributes candidates into the same ranking stack. The primary metric might be viewer value or watch time per user, but guardrails should include skip rate, “not interested” rate, integrity violations, creator concentration, cold-start performance, and p95/p99 retrieval latency. A key design decision is whether to fully replace the old retriever or interleave/add the new source; adding it is safer but makes attribution harder because the final ranker may suppress or amplify its candidates.

Common pitfalls

Analytical mistake: treating offline ranking wins as causal product wins. A tempting answer is, “The model has higher NDCG, so we should launch it.” A stronger answer says offline metrics are a filter, not proof; online A/B testing is needed because logging policy, position bias, candidate set changes, and user feedback loops can reverse apparent gains.

Communication mistake: listing every possible metric without a decision framework. Interviewers do not want a dashboard dump like “CTR, DAU, retention, watch time, likes, shares, comments, reports, revenue.” They want one primary metric, a small set of guardrails, and an explanation of how you would decide when metrics conflict.

Depth mistake: ignoring unit of analysis and interference. Saying “randomize impressions” may sound precise, but it is usually wrong for personalized ranking because user experiences are correlated across sessions and impressions. User-level randomization with user-level aggregation is typically safer, while social-network spillovers may require cluster analysis, longer holdouts, or explicit caveats.

Connections

Interviewers may pivot from this topic into metric design, causal inference under interference, recommender-system evaluation, marketplace experimentation, or sequential testing. Be ready to discuss position bias correction, inverse propensity weighting, CUPED, heterogeneous treatment effects, novelty effects, and long-term holdout experiments.