Handle missing data and outliers robustly
Company: OneMain Financial
Role: Data Scientist
Category: Machine Learning
Difficulty: hard
Interview Round: Technical Screen
You are modeling customer churn with features that include: numeric spend (heavy right tail, ~2% extreme outliers), counts with many zeros, and categorical plan types; missingness is a mix of MAR and MNAR (e.g., high-spend users sometimes omit income). 1) Propose a preprocessing pipeline for both linear models and tree ensembles covering imputation (median, KNN, MICE, model-based), indicator flags, robust scaling, and outlier treatment (winsorization vs robust estimators vs isolation-based filters). 2) Explain when each choice helps or hurts and why (e.g., how winsorization affects logistic vs tree splits; leakage risks in MICE). 3) Outline how you would empirically test the pipeline’s impact on calibration and SHAP explanations without optimistic bias. 4) If ~10% of records are MNAR on a key feature, what modeling or data-collection strategies would you apply to mitigate bias?
Quick Answer: This question evaluates competency in machine learning preprocessing and robustness, specifically handling missingness mechanisms (MAR vs MNAR), outlier treatment, model-specific feature handling for linear and tree-based algorithms, and empirical assessment of probability calibration and interpretability.