Churn Prediction on Messy Subscription Data
Context
You are building a binary churn-prediction model for a subscription product. Historical customer-level data contains usage/activity, billing/payments, support interactions, demographics, and plan details. The data is messy: many fields have missing values, there is class imbalance (churn is rarer than non-churn), and features are time-dependent. We aim to predict whether a customer will churn in the next period (e.g., next 30 days) using only information available up to a cutoff date.
Assumptions:
-
Binary target: churn = 1 if a customer cancels or fails to renew in the next period; 0 otherwise.
-
Temporal validation is required (train on earlier periods, validate on later periods).
-
Some missingness is likely not at random (e.g., missing usage could reflect inactivity).
Tasks
-
How would you handle missing values in the training data and justify your approach?
-
Given this churn-prediction problem, which ML algorithm would you choose and why?
-
Explain how Random Forest works, including voting, feature bagging, and depth control.
-
Define overfitting vs. underfitting and describe techniques to detect and mitigate each.