PracHub
QuestionsPremiumLearningGuidesInterview PrepNEWCoaches
|Home/Machine Learning/CVS Health

Design classification under missingness and imbalance

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a data scientist's competency in missing data diagnosis and imputation, handling severe class imbalance, choosing and interpreting classification and calibration metrics, and improving logistic regression as part of an end-to-end predictive pipeline on claims and EHR data, and it falls under the Machine Learning domain.

  • hard
  • CVS Health
  • Machine Learning
  • Data Scientist

Design classification under missingness and imbalance

Company: CVS Health

Role: Data Scientist

Category: Machine Learning

Difficulty: hard

Interview Round: Technical Screen

You are building a binary classifier to predict 30-day readmission from claims and EHR features. Roughly 30% of values are missing, but missingness may be concentrated in specific columns or rows. The positive class rate is 6%. Describe a concrete plan that addresses all of the following, with justifications and trade-offs: 1) Missing data: How will you diagnose the pattern (MCAR/MAR/MNAR), distinguish row-wise vs column-wise missingness, and decide when to drop rows/columns versus impute? Specify at least two imputation strategies (e.g., simple, model-based, indicator augmentation) and how you'd validate them within a pipeline without leakage. 2) Classification measurement: Define accuracy, precision, recall, specificity, F1, ROC-AUC, PR-AUC, and calibration. Give a real example where recall is the priority (and why) and another where precision is the priority (and why). Explain how threshold choice, cost matrices, or utility curves influence the selected metric. 3) Class imbalance: Propose at least three approaches (e.g., class weights, resampling such as SMOTE/ADASYN/undersampling, threshold moving, focal loss) and explain evaluation changes (PR curves, stratified CV, grouped CV by patient) to avoid optimistic bias. 4) Improving logistic regression: List specific feature-engineering and modeling steps (regularization choices and tuning, interaction terms, nonlinearity via splines, monotonic constraints where applicable, calibration methods like Platt/Isotonic, handling collinearity, robust standardization). Describe how you would run ablations and quantify lift vs a baseline. Be precise about preventing leakage, hyperparameter tuning protocol, and how you would present results to stakeholders.

Quick Answer: This question evaluates a data scientist's competency in missing data diagnosis and imputation, handling severe class imbalance, choosing and interpreting classification and calibration metrics, and improving logistic regression as part of an end-to-end predictive pipeline on claims and EHR data, and it falls under the Machine Learning domain.

Related Interview Questions

  • Build a leak-free sklearn churn pipeline - CVS Health (medium)
  • Handle challenges in MMM/MMX - CVS Health (hard)
  • Tune classifier and compute key metrics - CVS Health (medium)
  • Build an uplift model for targeting - CVS Health (hard)
  • Implement R² and Compare PCA With/Without Scaling - CVS Health (medium)
CVS Health logo
CVS Health
Oct 13, 2025, 9:49 PM
Data Scientist
Technical Screen
Machine Learning
2
0

30-Day Readmission Classifier: End-to-End Plan

Context: You are building a binary classifier to predict 30-day readmission using claims and EHR features. About 30% of feature values are missing, possibly concentrated by column or row. The positive rate (readmitted within 30 days) is 6%.

Describe a concrete, defensible plan that addresses the following, including justifications and trade-offs:

  1. Missing data
  • Diagnose MCAR/MAR/MNAR, distinguish row-wise vs column-wise missingness, and decide when to drop rows/columns versus impute.
  • Specify at least two imputation strategies (e.g., simple, model-based, indicator augmentation) and how you would validate them in a pipeline without leakage.
  1. Classification measurement
  • Define accuracy, precision, recall, specificity, F1, ROC-AUC, PR-AUC, and calibration.
  • Give one real example where recall is the priority (and why) and another where precision is the priority (and why).
  • Explain how threshold choice, cost matrices, or utility curves influence the selected metric.
  1. Class imbalance (6% positives)
  • Propose at least three approaches (e.g., class weights, resampling such as SMOTE/ADASYN/undersampling, threshold moving, focal loss) and explain evaluation changes (PR curves, stratified CV, grouped CV by patient) to avoid optimistic bias.
  1. Improving logistic regression
  • List specific feature-engineering and modeling steps: regularization choices and tuning, interaction terms, nonlinearity (e.g., splines), monotonic constraints where applicable, calibration methods (Platt/Isotonic), handling collinearity, robust standardization.
  • Describe ablation studies and how you would quantify lift vs a baseline.
  • Be precise about preventing leakage, hyperparameter tuning protocol, and how you would present results to stakeholders.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More CVS Health•More Data Scientist•CVS Health Data Scientist•CVS Health Machine Learning•Data Scientist Machine Learning
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.