Missing Data, Imbalance, And Data Quality

What's being tested

Candidates must show practical competence diagnosing and handling missing data, class imbalance, and data-quality issues inside a predictive workflow. Interviewers look for safe, leak-free pipeline construction, appropriate imputation/resampling choices, and correct metric selection and interpretation for skewed classes — all from a data-scientist (analysis/modeling) lens.

Patterns & templates

Temporal / group-safe splitting — use time-based holdouts or `GroupKFold` to prevent future leakage; O(n) data copy, check max(event_time) per fold.
Scikit-learn pipeline pattern — `Pipeline` + `ColumnTransformer` + `SimpleImputer`/`StandardScaler` for reproducible fit/transform lifecycle.
Missingness strategy matrix — use `SimpleImputer` for MCAR, `IterativeImputer`/model-based for MAR, flag missingness with indicator columns for MNAR.
Imputation pragmatics — prefer median for skewed numeric, mode for categorical; always fit imputers on train only (`fit`/`transform` split).
Imbalance handling — prefer evaluation-first: use `class_weight='balanced'` or `CalibratedClassifierCV`; use `SMOTE`/undersampling after train/validation split to avoid leakage.
Metrics for skew — report `roc_auc_score` plus `average_precision_score` (PR AUC); calibrate probabilities and report Brier score for calibration.
Data audit checklist — `pd.to_datetime`, `astype` casts, `isnull().sum()`, `describe()`; in SQL use `COALESCE` and explicit casts to avoid silent truncation.

Common pitfalls

Pitfall: Imputing using test+train pooled statistics — produces optimistic performance and feature leakage.

Pitfall: Relying solely on ROC AUC for 1% positive-rate problems — PR AUC and calibration matter more.

Pitfall: Running `SMOTE` or feature-engineering before splitting — synthetic samples or leakage inflate validation scores.

Practice these

The practice cards below cover the canonical variants — solve all of them and time yourself.

What's being tested

Patterns & templates

Common pitfalls

Practice these

Practice questions

Related concepts