ML Model Evaluation, Metrics, And Experimentation

What's being tested

Interviewers are probing whether you can design evaluation and experimentation loops for ML systems whose success cannot be judged by offline accuracy alone. For recommendation, nearby ranking, notification ranking, and multimodal generation, Meta cares about whether an MLE can connect training objectives, offline metrics, online A/B metrics, guardrails, and monitoring into one coherent launch process. You are expected to know how to choose metrics that match user value, detect offline/online mismatch, reason about bias in logged data, and safely iterate models in production. The strongest answers do not just say “run an A/B test”; they explain what is measured before launch, during ramp, and after deployment drift begins.

Core knowledge

Offline evaluation is necessary but insufficient. For ranking systems, common metrics include `AUC`, `log loss`, `NDCG@K`, `MAP@K`, `MRR`, `Recall@K`, and calibration error. `AUC` measures pairwise ordering globally, while `NDCG@K` emphasizes top-ranked items, which is usually closer to feed, notification, and place recommendation behavior.
Ranking metrics should match the product surface. For a notification ranker, `precision@1`, expected click/open probability, hide/report rate, and send suppression quality matter more than long-list `Recall@100`. For nearby places, `NDCG@K` may need distance-aware gains such as relevance discounted by travel distance or freshness.
Calibration matters when scores drive thresholds, auctions, notification sends, or multi-objective ranking. A model with good ranking may still be poorly calibrated. Check bins of predicted probability versus empirical rate, use expected calibration error:
$ECE = \sum_{b=1}^{B} \frac{|S_b|}{n} |\text{acc}(S_b) - \text{conf}(S_b)|$
and consider `Platt scaling`, isotonic regression, or temperature scaling when appropriate.
Counterfactual bias appears because training labels come from previously ranked or filtered items. If the old system never showed a place, you do not observe whether the user would have clicked it. Common mitigations include randomized exploration buckets, inverse propensity scoring, doubly robust estimators, and careful comparison against the serving policy that generated the logs.
Online experimentation should define one or two primary metrics, several secondary metrics, and hard guardrails. For recommenders, primary metrics might be meaningful engagement or long-term retention; guardrails often include `p95`/`p99` latency, notification opt-outs, hides, reports, integrity violations, battery/network usage, and diversity or fairness constraints.
A/B test power depends on variance, baseline rate, minimum detectable effect, and traffic. A simplified sample size intuition is:
$n \propto \frac{\sigma^2 (z_{\alpha/2} + z_\beta)^2}{\Delta^2}$
Small lifts in rare events like place visits or notification disables may require large samples or proxy metrics, but proxies must be validated.
Sequential testing and peeking can inflate false positives. In production ramps, metrics are often checked daily, but launch decisions should respect preplanned analysis windows, alpha spending, or sequential methods. A candidate should say “I would monitor guardrails continuously, but judge success on a prespecified window.”
Offline/online parity is an MLE responsibility. Training-time feature definitions, timestamp joins, normalization, embedding versions, missing-value handling, and model preprocessing should match serving. A small mismatch can produce a model that looks strong offline but regresses online, especially with real-time context like location, time of day, or session intent.
Slice-based evaluation catches average-metric failures. Break down quality by geography, language, device class, new versus mature users, cold-start items, low-connectivity regions, high-activity users, and sensitive integrity segments. A recommender can improve global `NDCG@10` while harming new users or causing over-notification for a small cohort.
Multi-objective optimization is common at Meta scale. Ranking often combines predicted click, dwell, conversion, hide, report, freshness, distance, and diversity:
$score = w_1 p(click) + w_2 p(save) - w_3 p(hide) - w_4 cost$
The key is to explain how weights are tuned offline, validated online, and constrained by guardrails.
Generative model evaluation needs both automatic and human/safety metrics. For image or multimodal generation, use prompt adherence, aesthetic quality, diversity, toxicity, policy violation rate, memorization checks, latency, and cost per generation. Automatic metrics like `FID`, `CLIPScore`, or embedding similarity are useful but can be gamed and should not replace human eval.
Post-launch monitoring should watch data drift, prediction drift, label drift, calibration drift, serving latency, feature freshness, and business guardrails. Use population stability index, embedding distribution shifts, calibration-by-time, and alerting on large deviations. Retraining should be triggered by measured degradation, not just a fixed schedule.

Worked example

For Design Nearby and Notification Ranking, a strong candidate would start by clarifying the surface: “Are we ranking nearby friend/activity suggestions, place notifications, or general push notifications, and is the objective opens, downstream engagement, or long-term notification health?” They would declare assumptions: candidates are generated elsewhere, the ranker scores a few hundred items per user request, and the system must respect latency and notification fatigue. The answer can be organized around four pillars: offline evaluation, online experiment design, guardrails, and monitoring.

For offline evaluation, they would propose `NDCG@K` or `precision@K` for top-ranked relevance, `log loss` and calibration for send thresholds, and slice metrics by geography, user activity level, and cold-start locations. For online testing, they would define a primary metric such as qualified notification opens or downstream sessions, with guardrails like opt-out rate, hides, reports, duplicate notifications, and `p99` ranking latency. A specific tradeoff to flag is exploration versus user trust: adding randomized exploration improves counterfactual learning, but too much random notification traffic can increase disables, so exploration should be small, controlled, and excluded or propensity-corrected in evaluation. They should also call out offline/online mismatch: location freshness, time-of-day features, and notification cooldown features must be computed consistently in training and serving. A good close would be: “If I had more time, I would add long-term holdouts to measure notification fatigue, plus calibration monitoring so the send threshold remains stable after retraining.”

A second angle

For Design image and multimodal generation systems, the same evaluation discipline applies, but the metrics shift from ranking relevance to generation quality, safety, and cost. Offline metrics like `FID`, `CLIPScore`, prompt adherence classifiers, and human preference ratings help compare model checkpoints, but they do not fully predict user satisfaction or policy risk. Online experimentation would measure generation completion rate, regeneration rate, explicit negative feedback, share/save rate, safety violation rate, latency, and GPU cost per successful generation. The harder constraint is that rare but severe failures matter: a model with better average aesthetic quality may still be unlaunchable if policy violation rate or memorization risk increases. A strong MLE frames evaluation as a layered gate: automated eval, red-team/safety eval, shadow or limited rollout, then controlled A/B testing with strict guardrails.

Common pitfalls

Pitfall: Optimizing only for `CTR` or opens.

For ranking and notification systems, `CTR` is tempting because it is dense and easy to measure, but it can reward clickbait, over-notification, or low-quality engagement. A better answer pairs an engagement metric with negative feedback, long-term retention, integrity metrics, and calibration checks.

Pitfall: Treating offline metric lift as proof of launch readiness.

An offline `NDCG@10` gain can disappear online because logs are biased by the old ranker, features differ between training and serving, or the new model changes the candidate distribution. Strong candidates explicitly discuss counterfactual bias, shadow evaluation, ramped A/B tests, and slice-level regressions.

Pitfall: Communicating a metric list without a decision framework.

Interviewers do not want a catalog of every metric you know. They want to hear which metric is primary, which metrics are guardrails, what launch threshold you would use, and what you would do if metrics conflict—for example, engagement rises but opt-outs also rise.

Connections

This topic often leads into feature store design, online serving latency, model monitoring, retraining pipelines, and candidate generation versus ranking tradeoffs. Interviewers may also pivot to multi-task learning, counterfactual learning-to-rank, or safe rollout mechanisms such as shadow mode, canaries, and gradual traffic ramps.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts