Design classification under missingness and imbalance

Q: Design classification under missingness and imbalance

This question evaluates a data scientist's competency in missing data diagnosis and imputation, handling severe class imbalance, choosing and interpreting classification and calibration metrics, and improving logistic regression as part of an end-to-end predictive pipeline on claims and EHR data, and it falls under the Machine Learning domain.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

30-Day Readmission Classifier: End-to-End Plan

Context: You are building a binary classifier to predict 30-day readmission using claims and EHR features. About 30% of feature values are missing, possibly concentrated by column or row. The positive rate (readmitted within 30 days) is 6%.

Describe a concrete, defensible plan that addresses the following, including justifications and trade-offs:

Missing data

Diagnose MCAR/MAR/MNAR, distinguish row-wise vs column-wise missingness, and decide when to drop rows/columns versus impute.
Specify at least two imputation strategies (e.g., simple, model-based, indicator augmentation) and how you would validate them in a pipeline without leakage.

Classification measurement

Define accuracy, precision, recall, specificity, F1, ROC-AUC, PR-AUC, and calibration.
Give one real example where recall is the priority (and why) and another where precision is the priority (and why).
Explain how threshold choice, cost matrices, or utility curves influence the selected metric.

Class imbalance (6% positives)

Propose at least three approaches (e.g., class weights, resampling such as SMOTE/ADASYN/undersampling, threshold moving, focal loss) and explain evaluation changes (PR curves, stratified CV, grouped CV by patient) to avoid optimistic bias.

Improving logistic regression

List specific feature-engineering and modeling steps: regularization choices and tuning, interaction terms, nonlinearity (e.g., splines), monotonic constraints where applicable, calibration methods (Platt/Isotonic), handling collinearity, robust standardization.
Describe ablation studies and how you would quantify lift vs a baseline.
Be precise about preventing leakage, hyperparameter tuning protocol, and how you would present results to stakeholders.

Design classification under missingness and imbalance

30-Day Readmission Classifier: End-to-End Plan

Solution

Comments (0)

Design classification under missingness and imbalance

Overview

30-Day Readmission Classifier: End-to-End Plan

Solution

Comments (0)