Restaurant Recommender: Offline Evaluation and Modeling
Context: You are scoring p(y=1|x) with logistic regression to predict if a user will engage with a recommended restaurant. Online A/B tests and surveys are not available yet. You need to compare two candidate models M0 and M1 and set practical decision rules.
1) Offline Evaluation Design (No A/B)
Design a defensible offline protocol to compare M0 and M1 using historical logs from a prior policy with known propensities (inverse propensity scoring). Specify:
-
The exact metric(s) to compute (e.g., IPS-weighted policy value, PR-AUC, Precision@K, calibrated Brier score).
-
How you will avoid leakage.
-
How you will choose K.
-
One pitfall of IPS when propensities are small and one mitigation.
2) Thresholded Metrics (Confusion Matrix)
On a holdout set of 1,000 recommendations at threshold 0.7, you observe:
-
TP = 120, FP = 30, TN = 820, FN = 30
Compute: precision, recall, specificity, F1, accuracy. Explain why accuracy can be misleading here and which metric aligns best if the goal is: "every shown item should be relevant."
3) Calibration
Describe how you would check and fix probability calibration (e.g., reliability diagrams, Platt scaling vs. isotonic regression). Why does good calibration matter when setting a rule like "only show if score ≥ τ"?
4) Model Choice
Justify logistic regression over more complex models. Name two failure modes (e.g., feature multicollinearity, class imbalance) and concrete fixes.
5) Network Effects and Leakage
If friend-activity features introduce interference, what offline split strategy reduces leakage (e.g., time-based, user-disjoint, or graph-clustered splits)? State the trade-offs.