Choose and compute recommender evaluation metrics

Q: Choose and compute recommender evaluation metrics

This question evaluates competency in offline recommender-system evaluation, propensity-weighted policy comparison, classification metric interpretation, probability calibration, model selection trade-offs, and handling data leakage and network interference.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

Restaurant Recommender: Offline Evaluation and Modeling

Context: You are scoring p(y=1|x) with logistic regression to predict if a user will engage with a recommended restaurant. Online A/B tests and surveys are not available yet. You need to compare two candidate models M0 and M1 and set practical decision rules.

1) Offline Evaluation Design (No A/B)

Design a defensible offline protocol to compare M0 and M1 using historical logs from a prior policy with known propensities (inverse propensity scoring). Specify:

The exact metric(s) to compute (e.g., IPS-weighted policy value, PR-AUC, Precision@K, calibrated Brier score).
How you will avoid leakage.
How you will choose K.
One pitfall of IPS when propensities are small and one mitigation.

2) Thresholded Metrics (Confusion Matrix)

On a holdout set of 1,000 recommendations at threshold 0.7, you observe:

TP = 120, FP = 30, TN = 820, FN = 30

Compute: precision, recall, specificity, F1, accuracy. Explain why accuracy can be misleading here and which metric aligns best if the goal is: "every shown item should be relevant."

3) Calibration

Describe how you would check and fix probability calibration (e.g., reliability diagrams, Platt scaling vs. isotonic regression). Why does good calibration matter when setting a rule like "only show if score ≥ τ"?

4) Model Choice

Justify logistic regression over more complex models. Name two failure modes (e.g., feature multicollinearity, class imbalance) and concrete fixes.

5) Network Effects and Leakage

If friend-activity features introduce interference, what offline split strategy reduces leakage (e.g., time-based, user-disjoint, or graph-clustered splits)? State the trade-offs.

Choose and compute recommender evaluation metrics

Restaurant Recommender: Offline Evaluation and Modeling

1) Offline Evaluation Design (No A/B)

2) Thresholded Metrics (Confusion Matrix)

3) Calibration

4) Model Choice

5) Network Effects and Leakage

Solution

Comments (0)

Choose and compute recommender evaluation metrics

Overview

Restaurant Recommender: Offline Evaluation and Modeling

1) Offline Evaluation Design (No A/B)

2) Thresholded Metrics (Confusion Matrix)

3) Calibration

4) Model Choice

5) Network Effects and Leakage

Solution

Comments (0)