Scikit-Learn Classification Pipelines

What's being tested

Candidates must demonstrate building leak-free, reproducible classification pipelines using `scikit-learn`: correct temporal splitting to avoid future information leakage; preprocessing (imputation, encoding, scaling) inside a pipeline; handling missingness and class imbalance; choosing evaluation metrics (ROC AUC, PR AUC, Brier score) and applying calibration and hyperparameter tuning without overfitting. Interviewers at CVS Health care about robust, auditable models for sensitive business and clinical decisions where leakage, miscalibration, or biased evaluation can cause wrong actions and poor patient/member outcomes.

Core knowledge

Temporal leakage — any feature containing future information (timestamps derived after label), or aggregations computed with full history, will inflate validation performance; always split by time and compute aggregations using only past windows.
Pipeline and ColumnTransformer — use `Pipeline` to chain `Estimator`s and `Transformer`s so `fit_transform` is applied per CV split; use `ColumnTransformer` to apply different transforms to numeric/categorical groups without leaking preprocessing.
Imputation patterns — distinguish MCAR/MAR/MNAR; prefer `SimpleImputer` or model-based imputation inside the pipeline; include missing indicator features (`add_indicator=True`) when missingness is informative.
Encoding categorical vars — use `OneHotEncoder(handle_unknown='ignore')` for nominals; `OrdinalEncoder` only if order is meaningful; avoid leaking rare-category statistics computed across folds.
Scaling and regularization — apply `StandardScaler` inside the pipeline before regularized models (`LogisticRegression` with `penalty='l2'); scaling must be fitted on training fold only.
Class imbalance tactics — prefer `class_weight='balanced' or penalized loss for calibration-friendly probability outputs; use resampling (`SMOTE`, random oversample) carefully and inside CV pipeline to avoid synthetic samples leaking.
Calibration & probabilities — evaluate calibration with Brier score: $\text{Brier} = \frac{1}{N}\sum_i(\hat p_i - y_i)^2$ and reliability diagrams; use `CalibratedClassifierCV` (sigmoid/isotonic) nested inside CV to avoid overfitting calibration.
Evaluation metrics — use PR AUC when positives are rare (gives precision/recall tradeoff); ROC AUC can be misleading with extreme imbalance; report both and calibration metrics, plus business thresholds (precision@k).
Cross-validation designs — for churn/time-series, use `TimeSeriesSplit` or holdout by customer cohort/time window; hyperparameter tuning should be nested CV to avoid optimistic estimates.
Feature engineering safely — compute aggregations (e.g., recency, frequency) with explicit lookback windows; store metadata (window length, anchor time) so features are reproducible at scoring time.
Threshold selection & cost-sensitivity — map threshold to business costs: choose threshold by maximizing expected utility or by precision/recall at the operating point rather than raw accuracy.
Operational considerations that matter for DS — keep a backtest (chronological holdout) that mirrors deployment horizon; monitor population and calibration drift over time post-deployment.

Worked example — Build a leak-free sklearn churn pipeline

First 30 seconds: ask clarifying questions — what is the label definition and its lookback/horizon (e.g., churn within 30 days), what event timestamps are available, and whether customers can appear in multiple folds. Frame the solution around three pillars: (1) data splitting: create a chronological train/validation/test split by anchor date or customer cohort, avoiding overlap; (2) pipeline composition: construct a `ColumnTransformer` that imputes (`SimpleImputer` with indicator), encodes (`OneHotEncoder(handle_unknown='ignore')`), scales (`StandardScaler`), and then fits a classifier; (3) evaluation and calibration: use nested `TimeSeriesSplit` CV for hyperparameter search, evaluate with both PR AUC and ROC AUC, and apply `CalibratedClassifierCV` if probabilities are used for risk scoring. A concrete tradeoff to call out: oversampling (e.g., `SMOTE`) can improve recall but often harms probability calibration — prefer `class_weight` for calibrated scores unless you recalibrate. Close by saying: if more time, I would add feature lookback validation, business-cost-based threshold selection, and monitor post-deployment calibration drift.

A second angle — Design classification under missingness and imbalance

Start by diagnosing missingness patterns (per-feature missing rates and correlation with label). For MAR patterns, use model-based or KNN imputation inside the pipeline and keep missing indicators; for MNAR, consult domain owners — sometimes missingness itself is a predictive signal. For severe class imbalance, prioritize metrics that reflect operational goals: PR AUC and precision at fixed recall or top-k precision. Prefer calibrated `LogisticRegression` with regularization and `class_weight` to produce reliable probabilities; if resampling is used, do it only inside the training fold and follow with calibration on a validation fold. This reframes the same pipeline principles but emphasizes diagnostic steps and calibration-first choices given noisy/missing data.

Common pitfalls

Pitfall: fitting preprocessing outside cross-validation
A common error is imputing or scaling on the full dataset before CV, leaking summary statistics. Always encapsulate transforms inside a `Pipeline` so each fold learns transforms from training data only.

Pitfall: optimizing for ROC AUC on highly imbalanced data
ROC AUC can mask poor precision for rare positives; report PR AUC and precision@k or thresholded metrics that reflect the action you’ll take (e.g., outreach capacity).

Pitfall: ignoring calibration after using resampling
Synthetic oversampling can distort predicted probabilities. If resampling is necessary, explicitly recalibrate (e.g., `CalibratedClassifierCV`) using a validation split consistent with production scoring.

Connections

Interviewers may pivot to model monitoring and drift detection (population shift, calibration drift) or to causal inference/experiment design for measuring the impact of interventions on churn. They might also ask about moving the pipeline to production and what metrics you'd monitor post-deployment.

What's being tested

Core knowledge

Worked example — Build a leak-free sklearn churn pipeline

A second angle — Design classification under missingness and imbalance

Common pitfalls

Connections

Further reading

Practice questions

Related concepts