This question evaluates competency in offline recommender-system evaluation, propensity-weighted policy comparison, classification metric interpretation, probability calibration, model selection trade-offs, and handling data leakage and network interference.

Context: You are scoring p(y=1|x) with logistic regression to predict if a user will engage with a recommended restaurant. Online A/B tests and surveys are not available yet. You need to compare two candidate models M0 and M1 and set practical decision rules.
Design a defensible offline protocol to compare M0 and M1 using historical logs from a prior policy with known propensities (inverse propensity scoring). Specify:
On a holdout set of 1,000 recommendations at threshold 0.7, you observe:
Compute: precision, recall, specificity, F1, accuracy. Explain why accuracy can be misleading here and which metric aligns best if the goal is: "every shown item should be relevant."
Describe how you would check and fix probability calibration (e.g., reliability diagrams, Platt scaling vs. isotonic regression). Why does good calibration matter when setting a rule like "only show if score ≥ τ"?
Justify logistic regression over more complex models. Name two failure modes (e.g., feature multicollinearity, class imbalance) and concrete fixes.
If friend-activity features introduce interference, what offline split strategy reduces leakage (e.g., time-based, user-disjoint, or graph-clustered splits)? State the trade-offs.
Login required