30-Day Readmission Classifier: End-to-End Plan
Context: You are building a binary classifier to predict 30-day readmission using claims and EHR features. About 30% of feature values are missing, possibly concentrated by column or row. The positive rate (readmitted within 30 days) is 6%.
Describe a concrete, defensible plan that addresses the following, including justifications and trade-offs:
-
Missing data
-
Diagnose MCAR/MAR/MNAR, distinguish row-wise vs column-wise missingness, and decide when to drop rows/columns versus impute.
-
Specify at least two imputation strategies (e.g., simple, model-based, indicator augmentation) and how you would validate them in a pipeline without leakage.
-
Classification measurement
-
Define accuracy, precision, recall, specificity, F1, ROC-AUC, PR-AUC, and calibration.
-
Give one real example where recall is the priority (and why) and another where precision is the priority (and why).
-
Explain how threshold choice, cost matrices, or utility curves influence the selected metric.
-
Class imbalance (6% positives)
-
Propose at least three approaches (e.g., class weights, resampling such as SMOTE/ADASYN/undersampling, threshold moving, focal loss) and explain evaluation changes (PR curves, stratified CV, grouped CV by patient) to avoid optimistic bias.
-
Improving logistic regression
-
List specific feature-engineering and modeling steps: regularization choices and tuning, interaction terms, nonlinearity (e.g., splines), monotonic constraints where applicable, calibration methods (Platt/Isotonic), handling collinearity, robust standardization.
-
Describe ablation studies and how you would quantify lift vs a baseline.
-
Be precise about preventing leakage, hyperparameter tuning protocol, and how you would present results to stakeholders.