PracHub
QuestionsPremiumLearningGuidesInterview PrepNEWCoaches
|Home/Machine Learning/CVS Health

Build a leak-free sklearn churn pipeline

Last updated: Mar 29, 2026

Quick Overview

This question evaluates practical competencies in building a reproducible scikit-learn churn prediction pipeline—covering temporal splitting to avoid leakage, preprocessing, calibration, hyperparameter tuning, evaluation with ROC AUC/PR AUC and F1-based thresholds, and permutation feature importance—and is in the Machine Learning domain.

  • medium
  • CVS Health
  • Machine Learning
  • Data Scientist

Build a leak-free sklearn churn pipeline

Company: CVS Health

Role: Data Scientist

Category: Machine Learning

Difficulty: medium

Interview Round: Take-home Project

Write Python (sklearn) code to build a reproducible classification pipeline for predicting user subscription in the next 30 days. Dataset columns: user_id (str), event_date (date), country (categorical), device_type (categorical), sessions_last_7d (int), purchases_last_30d (int), avg_session_secs (float), days_since_signup (int), is_subscribed (0/1 target). Requirements: a) Split the data temporally: use rows with event_date ≤ 2025-08-25 for training and 2025-08-26..2025-09-01 for validation ("today" is 2025-09-01). b) Use ColumnTransformer with: numeric pipeline = SimpleImputer(strategy='median') + StandardScaler(); categorical pipeline = SimpleImputer(strategy='most_frequent') + OneHotEncoder(handle_unknown='ignore'). c) Use LogisticRegression with class_weight='balanced' and max_iter tuned; perform a small hyperparameter search over C on a StratifiedKFold CV within the training set only. d) Evaluate ROC AUC and PR AUC on the validation window; also select a decision threshold that maximizes F1 on validation and report precision/recall/F1 at that threshold. e) Calibrate predicted probabilities using CalibratedClassifierCV on the training set (CV=3) without leaking validation. f) Demonstrate how you would compute permutation feature importance on the validation set and list the top 5 features by importance. g) Briefly explain one potential target leakage risk in this schema and how your pipeline avoids it.

Quick Answer: This question evaluates practical competencies in building a reproducible scikit-learn churn prediction pipeline—covering temporal splitting to avoid leakage, preprocessing, calibration, hyperparameter tuning, evaluation with ROC AUC/PR AUC and F1-based thresholds, and permutation feature importance—and is in the Machine Learning domain.

Related Interview Questions

  • Handle challenges in MMM/MMX - CVS Health (hard)
  • Design classification under missingness and imbalance - CVS Health (hard)
  • Tune classifier and compute key metrics - CVS Health (medium)
  • Build an uplift model for targeting - CVS Health (hard)
  • Implement R² and Compare PCA With/Without Scaling - CVS Health (medium)
CVS Health logo
CVS Health
Oct 13, 2025, 9:49 PM
Data Scientist
Take-home Project
Machine Learning
4
0

Take‑Home ML Task: Reproducible Subscription Classification Pipeline

You are given a daily user-level dataset and must build a reproducible Python (scikit‑learn) pipeline to predict whether a user will subscribe in the next 30 days.

Assume the dataset contains one row per user per event_date with these columns:

  • user_id (string)
  • event_date (date)
  • country (categorical)
  • device_type (categorical)
  • sessions_last_7d (int)
  • purchases_last_30d (int)
  • avg_session_secs (float)
  • days_since_signup (int)
  • is_subscribed (0/1 target)

Constraints and requirements:

  1. Temporal split (no leakage):
    • Training data: rows with event_date ≤ 2025‑08‑25
    • Validation data: rows with event_date in 2025‑08‑26..2025‑09‑01 (inclusive)
    • Today is 2025‑09‑01
  2. Preprocessing via ColumnTransformer:
    • Numeric pipeline: SimpleImputer(strategy='median') → StandardScaler()
    • Categorical pipeline: SimpleImputer(strategy='most_frequent') → OneHotEncoder(handle_unknown='ignore')
  3. Classifier and hyperparameters:
    • LogisticRegression with class_weight='balanced'
    • Tune max_iter and perform a small hyperparameter search over C using StratifiedKFold CV on the training set only
  4. Evaluation on the validation window:
    • Report ROC AUC and PR AUC
    • Choose a decision threshold that maximizes F1 on validation; report precision, recall, and F1 at that threshold
  5. Probability calibration:
    • Use CalibratedClassifierCV on the training set only (CV=3), avoiding any validation leakage
  6. Feature importance:
    • Compute permutation feature importance on the validation set and list the top 5 features by importance
  7. Briefly explain one potential target leakage risk in this schema and how your pipeline avoids it.

Notes

  • Exclude user_id and event_date from model features.
  • Ensure reproducibility (fixed random seeds, deterministic splits).

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More CVS Health•More Data Scientist•CVS Health Data Scientist•CVS Health Machine Learning•Data Scientist Machine Learning
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.