This question evaluates a candidate's competency in feature engineering and preprocessing for mixed-type tabular data, including handling sparse counts and heavy-tailed monetary features, missingness and zero-inflation, correlated continuous measurements, model-specific scaling needs, and the design of leak-safe pipelines and validation strategies.
You are given a tabular dataset for supervised learning with features: F1 (counts, mostly small integers with many zeros), F2 (monetary amounts in dollars, heavy-tailed), F3 (binary flag), F4 and F5 (highly correlated continuous measurements), and target y. Tasks: 1) Decide exactly which features need standardization or normalization and why; specify the scaler and whether to fit on train only to avoid leakage. 2) Propose a principled approach for F1 when it has many zeros and missing values: imputation options, zero-inflated modeling, or transformations; justify how you will validate the choice. 3) With F4 and F5 strongly correlated (|r| > 0.9), describe three alternative strategies to select or transform features (e.g., VIF thresholding, L1-penalized model, PCA) and how to choose among them with cross-validation while keeping interpretability. 4) For three model families (linear/logistic with regularization, tree-based ensembles, and k-NN), specify exactly how your preprocessing differs and why scale and correlation matter differently. 5) Provide a leak-safe sklearn-style pipeline and cross-validation plan that evaluates these choices, including metrics, stratification, and how you would compare pipelines statistically.