Missing Data, Imbalance, And Data Quality
Asked of: Data Scientist
Last updated
What's being tested
Candidates must show practical competence diagnosing and handling missing data, class imbalance, and data-quality issues inside a predictive workflow. Interviewers look for safe, leak-free pipeline construction, appropriate imputation/resampling choices, and correct metric selection and interpretation for skewed classes — all from a data-scientist (analysis/modeling) lens.
Patterns & templates
-
Temporal / group-safe splitting — use time-based holdouts or
`GroupKFold`to prevent future leakage; O(n) data copy, check max(event_time) per fold. -
Scikit-learn pipeline pattern —
`Pipeline`+`ColumnTransformer`+`SimpleImputer`/`StandardScaler`for reproducible fit/transform lifecycle. -
Missingness strategy matrix — use
`SimpleImputer`for MCAR,`IterativeImputer`/model-based for MAR, flag missingness with indicator columns for MNAR. -
Imputation pragmatics — prefer median for skewed numeric, mode for categorical; always fit imputers on train only (
`fit`/`transform`split). -
Imbalance handling — prefer evaluation-first: use
`class_weight='balanced'`or`CalibratedClassifierCV`; use`SMOTE`/undersampling after train/validation split to avoid leakage. -
Metrics for skew — report
`roc_auc_score`plus`average_precision_score`(PR AUC); calibrate probabilities and report Brier score for calibration. -
Data audit checklist —
`pd.to_datetime`,`astype`casts,`isnull().sum()`,`describe()`; in SQL use`COALESCE`and explicit casts to avoid silent truncation.
Common pitfalls
Pitfall: Imputing using test+train pooled statistics — produces optimistic performance and feature leakage.
Pitfall: Relying solely on ROC AUC for 1% positive-rate problems — PR AUC and calibration matter more.
Pitfall: Running
`SMOTE`or feature-engineering before splitting — synthetic samples or leakage inflate validation scores.
Practice these
The practice cards below cover the canonical variants — solve all of them and time yourself.
Practice questions
- Build a leak-free sklearn churn pipelineCVS Health · Data Scientist · Take-home Project · medium
- Diagnose a failing campaignCVS Health · Data Scientist · Technical Screen · hard
- Design classification under missingness and imbalanceCVS Health · Data Scientist · Technical Screen · hard
- Aggregate radiology spend and derive fiscal monthCVS Health · Data Scientist · Technical Screen · Medium
- Compute age-band spend and YoY in GeorgiaCVS Health · Data Scientist · Technical Screen · Medium
- Lead structured response to accuracy incidentCVS Health · Data Scientist · Technical Screen · hard
- Write SQL for dedup and purchase sharesCVS Health · Data Scientist · Technical Screen · Medium
Related concepts
- Evaluation, Statistical Inference, And Class ImbalanceMachine Learning
- Supervised ML, Imbalance, Overfitting, And OptimizationMachine Learning
- Distribution Interpretation And Data DiagnosticsMachine Learning
- Platform Integrity: Fake Accounts, Bots, Fraud, And Harmful ContentAnalytics & Experimentation
- Fraud, Bot, And Fake Account Detection
- Experiment Diagnostics, Power And Robust InferenceStatistics & Math