CLV_90 Prediction Pipeline under Zero-Inflation, Heavy Tails, and Multicollinearity
Context
You need to predict 90-day customer value (CLV_90) at the user level. Targets have many zeros and a heavy right tail. Features are available up to a fixed cutoff date per user and include:
-
Continuous channel spends: Spend_SEM, Spend_Social, Spend_Display
-
RFM features (recency, frequency, monetary)
-
Device, region, tenure
-
Dozens of sparse, high-cardinality campaign IDs
Known issues:
-
Strong multicollinearity among spend channels
-
Heteroskedastic errors
-
Nonlinear effects
Task
Design an end-to-end regression pipeline that:
-
Chooses and justifies an appropriate loss/distribution (e.g., log-link GLM, Tweedie, zero-inflated/hurdle, quantile regression, or gradient-boosted regression).
-
Performs feature processing: log1p transforms, standardization, rare-category bucketing, and target encoding for high-cardinality features using K-fold out-of-fold encoding to avoid leakage.
-
Handles multicollinearity and feature selection with ridge/lasso/elastic net. Write the elastic net objective with α and λ, and explain when each penalty dominates.
-
Sets up time-based nested cross-validation that prevents leakage from future signals and campaign overlap.
-
Diagnoses model assumptions and mitigations: heteroskedasticity (e.g., robust SE, variance-stabilizing transforms, Tweedie mean–variance), nonlinearity (splines/interactions), and influential outliers (Huber/Tukey loss).
-
Provides calibrated 95% prediction intervals for CLV_90 (e.g., conformal prediction or bootstrap) and compares them to OLS analytic intervals.
-
Interprets effects: standardized coefficients for a regularized linear model versus SHAP values for a tree model.
-
Quantifies and mitigates data leakage risks (e.g., target leakage via post-cutoff features, look-ahead in target encoding).
-
Compares performance and interpretability trade-offs between a regularized linear model and a gradient-boosting model, and specifies metrics robust to heavy tails (e.g., MAE, Quantile loss, MAPE with epsilon).