Offline Evaluation of a New Recommendation Feature
Scenario
You need to estimate the business value of a new recommendation-system feature using only historical data, before any live deployment.
Task
Describe how you would evaluate whether releasing this feature is a good or bad idea without running an A/B test. Specify:
-
The analyses you would run and in what order
-
The metrics you would use
-
The assumptions required for each method
-
How you would validate the conclusions
Constraints
-
Use only historical logs.
-
Do not propose live randomized experiments.
Hints
-
Offline log replay / counterfactual (off-policy) evaluation
-
Inverse propensity scoring (IPS), self-normalized IPS (SNIPS), doubly robust (DR)
-
Uplift or propensity modeling
-
Simulation with a user-response model
-
Historical hold-out metrics (e.g., NDCG, recall) and their calibration to online outcomes