Build a leak-free sklearn churn pipeline

Q: Build a leak-free sklearn churn pipeline

This question evaluates practical competencies in building a reproducible scikit-learn churn prediction pipeline—covering temporal splitting to avoid leakage, preprocessing, calibration, hyperparameter tuning, evaluation with ROC AUC/PR AUC and F1-based thresholds, and permutation feature importance—and is in the Machine Learning domain.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

Take‑Home ML Task: Reproducible Subscription Classification Pipeline

You are given a daily user-level dataset and must build a reproducible Python (scikit‑learn) pipeline to predict whether a user will subscribe in the next 30 days.

Assume the dataset contains one row per user per event_date with these columns:

user_id (string)
event_date (date)
country (categorical)
device_type (categorical)
sessions_last_7d (int)
purchases_last_30d (int)
avg_session_secs (float)
days_since_signup (int)
is_subscribed (0/1 target)

Constraints and requirements:

Temporal split (no leakage):
- Training data: rows with event_date ≤ 2025‑08‑25
- Validation data: rows with event_date in 2025‑08‑26..2025‑09‑01 (inclusive)
- Today is 2025‑09‑01
Preprocessing via ColumnTransformer:
- Numeric pipeline: SimpleImputer(strategy='median') → StandardScaler()
- Categorical pipeline: SimpleImputer(strategy='most_frequent') → OneHotEncoder(handle_unknown='ignore')
Classifier and hyperparameters:
- LogisticRegression with class_weight='balanced'
- Tune max_iter and perform a small hyperparameter search over C using StratifiedKFold CV on the training set only
Evaluation on the validation window:
- Report ROC AUC and PR AUC
- Choose a decision threshold that maximizes F1 on validation; report precision, recall, and F1 at that threshold
Probability calibration:
- Use CalibratedClassifierCV on the training set only (CV=3), avoiding any validation leakage
Feature importance:
- Compute permutation feature importance on the validation set and list the top 5 features by importance
Briefly explain one potential target leakage risk in this schema and how your pipeline avoids it.

Notes

Exclude user_id and event_date from model features.
Ensure reproducibility (fixed random seeds, deterministic splits).

Build a leak-free sklearn churn pipeline

Take‑Home ML Task: Reproducible Subscription Classification Pipeline

Solution

Comments (0)

Build a leak-free sklearn churn pipeline

Overview

Take‑Home ML Task: Reproducible Subscription Classification Pipeline

Solution

Comments (0)