PracHub
QuestionsPremiumLearningGuidesInterview PrepNEWCoaches
|Home/Machine Learning/Meta

Choose and compute recommender evaluation metrics

Last updated: Mar 29, 2026

Quick Overview

This question evaluates competency in offline recommender-system evaluation, propensity-weighted policy comparison, classification metric interpretation, probability calibration, model selection trade-offs, and handling data leakage and network interference.

  • hard
  • Meta
  • Machine Learning
  • Data Scientist

Choose and compute recommender evaluation metrics

Company: Meta

Role: Data Scientist

Category: Machine Learning

Difficulty: hard

Interview Round: HR Screen

You’re building a restaurant recommender with a logistic regression scorer p(y=1|x) for whether a user engages with a recommendation. A/B tests and user surveys are unavailable for now. 1) Offline evaluation design: Propose a defensible offline protocol to compare two models M0 and M1 without A/B (e.g., counterfactual evaluation with inverse propensity scores from historical logging policy). Specify the exact metric(s) you’ll compute (e.g., PR-AUC, Precision@K, calibrated Brier score), how you’ll avoid leakage, and how you’ll choose K. Name one pitfall of IPS when propensities are small and a mitigation. 2) Thresholded metrics: Suppose on a holdout set of 1,000 recommendations at threshold 0.7 you observe: TP=120, FP=30, TN=820, FN=30. Compute precision, recall, specificity, F1, accuracy. Explain why accuracy can be misleading here and which metric aligns best if the goal is “every shown item should be relevant.” 3) Calibration: Describe how you would check and fix probability calibration (e.g., reliability diagrams, Platt scaling vs. isotonic). Why does good calibration matter when setting business rules like “only show if score ≥ τ”? 4) Model choice: Justify logistic regression over more complex models for this case; name two failure modes (e.g., feature multicollinearity, class imbalance) and concrete fixes. 5) Network effects: If friend activity features introduce interference, what offline split strategy reduces leakage (e.g., time-based, user-disjoint, or graph-clustered splits)? State the trade-offs.

Quick Answer: This question evaluates competency in offline recommender-system evaluation, propensity-weighted policy comparison, classification metric interpretation, probability calibration, model selection trade-offs, and handling data leakage and network interference.

Related Interview Questions

  • Design and evaluate an ads ranking algorithm - Meta (easy)
  • How would you design a Shop Ads ranking algorithm? - Meta (easy)
  • Derive Linear Regression Solution - Meta (medium)
  • Explain key ML metrics and techniques - Meta (medium)
  • Design an ad recommendation ranking approach - Meta (easy)
Meta logo
Meta
Oct 13, 2025, 9:49 PM
Data Scientist
HR Screen
Machine Learning
3
0

Restaurant Recommender: Offline Evaluation and Modeling

Context: You are scoring p(y=1|x) with logistic regression to predict if a user will engage with a recommended restaurant. Online A/B tests and surveys are not available yet. You need to compare two candidate models M0 and M1 and set practical decision rules.

1) Offline Evaluation Design (No A/B)

Design a defensible offline protocol to compare M0 and M1 using historical logs from a prior policy with known propensities (inverse propensity scoring). Specify:

  • The exact metric(s) to compute (e.g., IPS-weighted policy value, PR-AUC, Precision@K, calibrated Brier score).
  • How you will avoid leakage.
  • How you will choose K.
  • One pitfall of IPS when propensities are small and one mitigation.

2) Thresholded Metrics (Confusion Matrix)

On a holdout set of 1,000 recommendations at threshold 0.7, you observe:

  • TP = 120, FP = 30, TN = 820, FN = 30

Compute: precision, recall, specificity, F1, accuracy. Explain why accuracy can be misleading here and which metric aligns best if the goal is: "every shown item should be relevant."

3) Calibration

Describe how you would check and fix probability calibration (e.g., reliability diagrams, Platt scaling vs. isotonic regression). Why does good calibration matter when setting a rule like "only show if score ≥ τ"?

4) Model Choice

Justify logistic regression over more complex models. Name two failure modes (e.g., feature multicollinearity, class imbalance) and concrete fixes.

5) Network Effects and Leakage

If friend-activity features introduce interference, what offline split strategy reduces leakage (e.g., time-based, user-disjoint, or graph-clustered splits)? State the trade-offs.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Meta•More Data Scientist•Meta Data Scientist•Meta Machine Learning•Data Scientist Machine Learning
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.