Churn Prediction Model: Leakage, Validation, KPIs, Interpretation, Monitoring
Context: You inherit a weekly-scored model that predicts whether a user will place an order in the next 28 days. Some features were built from logs in ways that leak information from the post-prediction label window. Address the following tasks.
(a) Leakage identification and repair
-
Identify at least five concrete leakage sources that can occur in a logs-based feature set (e.g., features derived from orders within the label window, coupon_applied in the next 28 days, post-treatment delivery ETA, label-influenced support contacts).
-
For each, rewrite the feature so it is computable at prediction time with a strict event-time cutoff.
(b) Time-based cross-validation (rolling origin)
-
Propose a rolling origin cross-validation scheme and define train/validation/test splits that ensure no future leakage.
-
Specify how to handle users entering/leaving the cohort across time and how to handle cold-start users.
(c) KPIs, thresholding, and calibration
-
Offline metrics are AUC = 0.79 and PR-AUC = 0.23. Online, targeting the top decile increases conversion by 1.5% but raises cancellations by 0.3pp.
-
Define business KPIs and a cost-sensitive objective to tune the decision threshold.
-
Include probability calibration (Platt or Isotonic) and how you would check calibration drift in production.
(d) Interpretation
-
A standardized feature past_7d_orders has a logistic regression coefficient of 0.40 and a baseline log-odds of churn of −1.50.
-
Compute the odds ratio for a +1 SD increase and the resulting change in churn probability from the baseline.
-
Explain the limitations of such ceteris paribus interpretations in correlated feature settings.
(e) Monitoring and retraining
-
Outline a monitoring plan (data quality, feature distributions, PSI, label delay, prediction drift) and a retraining policy tied to performance and covariate shift triggers.