This question evaluates a data scientist's competency in handling messy, temporal subscription data including missing-value strategies, class imbalance, temporal validation, feature engineering, model selection, and understanding ensemble methods and model generalization (Random Forest internals and overfitting/underfitting).
You are building a binary churn-prediction model for a subscription product. Historical customer-level data contains usage/activity, billing/payments, support interactions, demographics, and plan details. The data is messy: many fields have missing values, there is class imbalance (churn is rarer than non-churn), and features are time-dependent. We aim to predict whether a customer will churn in the next period (e.g., next 30 days) using only information available up to a cutoff date.
Assumptions:
Login required