Design and diagnose a regression pipeline
Company: Voleon Group
Role: Data Scientist
Category: Machine Learning
Difficulty: hard
Interview Round: Technical Screen
You must predict 90-day customer value CLV_90 (many zeros, heavy right tail) at the user level from features collected up to a cutoff date: continuous spends by marketing channel (Spend_SEM, Spend_Social, Spend_Display), recency/frequency/monetary (RFM), device, region, tenure, and dozens of sparse categorical campaign IDs (high cardinality), with known issues: strong multicollinearity among spend channels, heteroskedastic errors, and nonlinear effects. Design a regression pipeline that: (1) chooses an appropriate loss/distribution (e.g., log-link GLM, Tweedie, zero-inflated, quantile regression, or gradient-boosted regression) and justifies it; (2) performs feature processing (log1p transforms, standardization, rare-category bucketing, target encoding for high-cardinality features with K-fold out-of-fold encoding to avoid leakage); (3) handles multicollinearity and feature selection (ridge/lasso/elastic net): write the elastic net objective with α and λ, and explain when each penalty dominates; (4) sets up time-based nested cross-validation that prevents leakage from future signals and campaign overlap; (5) diagnoses model assumptions: detect and address heteroskedasticity (e.g., via White-robust SE, variance-stabilizing transforms, or modeling the mean-variance relationship in a Tweedie GLM), nonlinearity (splines/interactions), and influential outliers (Huber/Tukey loss); (6) provides calibrated 95% prediction intervals for CLV_90 (e.g., conformal prediction or bootstrap) and compares them to OLS analytic intervals; (7) interprets effects: compute and interpret standardized coefficients for a regularized linear model versus SHAP values for a tree model; (8) quantifies and mitigates data leakage risks (e.g., target leakage via post-cutoff features, look-ahead in target encoding); and (9) contrasts performance and interpretability trade-offs between a regularized linear model and a gradient-boosting model, specifying metrics that are robust to heavy tails (e.g., MAE, Quantile loss, MAPE with epsilon).
Quick Answer: This question evaluates a data scientist's ability to design and diagnose an end-to-end regression pipeline for zero-inflated, heavy-tailed targets, focusing on feature engineering for high-cardinality variables, handling multicollinearity, choosing appropriate loss/distribution and regularization, time-aware validation, uncertainty quantification, interpretability, and data leakage mitigation. It is commonly asked in the Machine Learning domain to test practical application alongside conceptual understanding of regression modeling, distributional assumptions, cross-validation strategies, and performance versus interpretability trade-offs.