Design classification under missingness and imbalance
Company: CVS Health
Role: Data Scientist
Category: Machine Learning
Difficulty: hard
Interview Round: Technical Screen
You are building a binary classifier to predict 30-day readmission from claims and EHR features. Roughly 30% of values are missing, but missingness may be concentrated in specific columns or rows. The positive class rate is 6%.
Describe a concrete plan that addresses all of the following, with justifications and trade-offs:
1) Missing data: How will you diagnose the pattern (MCAR/MAR/MNAR), distinguish row-wise vs column-wise missingness, and decide when to drop rows/columns versus impute? Specify at least two imputation strategies (e.g., simple, model-based, indicator augmentation) and how you'd validate them within a pipeline without leakage.
2) Classification measurement: Define accuracy, precision, recall, specificity, F1, ROC-AUC, PR-AUC, and calibration. Give a real example where recall is the priority (and why) and another where precision is the priority (and why). Explain how threshold choice, cost matrices, or utility curves influence the selected metric.
3) Class imbalance: Propose at least three approaches (e.g., class weights, resampling such as SMOTE/ADASYN/undersampling, threshold moving, focal loss) and explain evaluation changes (PR curves, stratified CV, grouped CV by patient) to avoid optimistic bias.
4) Improving logistic regression: List specific feature-engineering and modeling steps (regularization choices and tuning, interaction terms, nonlinearity via splines, monotonic constraints where applicable, calibration methods like Platt/Isotonic, handling collinearity, robust standardization). Describe how you would run ablations and quantify lift vs a baseline.
Be precise about preventing leakage, hyperparameter tuning protocol, and how you would present results to stakeholders.
Quick Answer: This question evaluates a data scientist's competency in missing data diagnosis and imputation, handling severe class imbalance, choosing and interpreting classification and calibration metrics, and improving logistic regression as part of an end-to-end predictive pipeline on claims and EHR data, and it falls under the Machine Learning domain.