Design and diagnose a regression pipeline

Q: Design and diagnose a regression pipeline

This question evaluates a data scientist's ability to design and diagnose an end-to-end regression pipeline for zero-inflated, heavy-tailed targets, focusing on feature engineering for high-cardinality variables, handling multicollinearity, choosing appropriate loss/distribution and regularization, time-aware validation, uncertainty quantification, interpretability, and data leakage mitigation. It is commonly asked in the Machine Learning domain to test practical application alongside conceptual understanding of regression modeling, distributional assumptions, cross-validation strategies, and performance versus interpretability trade-offs.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

CLV_90 Prediction Pipeline under Zero-Inflation, Heavy Tails, and Multicollinearity

Context

You need to predict 90-day customer value (CLV_90) at the user level. Targets have many zeros and a heavy right tail. Features are available up to a fixed cutoff date per user and include:

Continuous channel spends: Spend_SEM, Spend_Social, Spend_Display
RFM features (recency, frequency, monetary)
Device, region, tenure
Dozens of sparse, high-cardinality campaign IDs

Known issues:

Strong multicollinearity among spend channels
Heteroskedastic errors
Nonlinear effects

Task

Design an end-to-end regression pipeline that:

Chooses and justifies an appropriate loss/distribution (e.g., log-link GLM, Tweedie, zero-inflated/hurdle, quantile regression, or gradient-boosted regression).
Performs feature processing: log1p transforms, standardization, rare-category bucketing, and target encoding for high-cardinality features using K-fold out-of-fold encoding to avoid leakage.
Handles multicollinearity and feature selection with ridge/lasso/elastic net. Write the elastic net objective with α and λ, and explain when each penalty dominates.
Sets up time-based nested cross-validation that prevents leakage from future signals and campaign overlap.
Diagnoses model assumptions and mitigations: heteroskedasticity (e.g., robust SE, variance-stabilizing transforms, Tweedie mean–variance), nonlinearity (splines/interactions), and influential outliers (Huber/Tukey loss).
Provides calibrated 95% prediction intervals for CLV_90 (e.g., conformal prediction or bootstrap) and compares them to OLS analytic intervals.
Interprets effects: standardized coefficients for a regularized linear model versus SHAP values for a tree model.
Quantifies and mitigates data leakage risks (e.g., target leakage via post-cutoff features, look-ahead in target encoding).
Compares performance and interpretability trade-offs between a regularized linear model and a gradient-boosting model, and specifies metrics robust to heavy tails (e.g., MAE, Quantile loss, MAPE with epsilon).

Design and diagnose a regression pipeline

CLV_90 Prediction Pipeline under Zero-Inflation, Heavy Tails, and Multicollinearity

Context

Task

Solution

Comments (0)

Design and diagnose a regression pipeline

Overview

CLV_90 Prediction Pipeline under Zero-Inflation, Heavy Tails, and Multicollinearity

Context

Task

Solution

Comments (0)