Take‑Home ML Task: Reproducible Subscription Classification Pipeline
You are given a daily user-level dataset and must build a reproducible Python (scikit‑learn) pipeline to predict whether a user will subscribe in the next 30 days.
Assume the dataset contains one row per user per event_date with these columns:
-
user_id (string)
-
event_date (date)
-
country (categorical)
-
device_type (categorical)
-
sessions_last_7d (int)
-
purchases_last_30d (int)
-
avg_session_secs (float)
-
days_since_signup (int)
-
is_subscribed (0/1 target)
Constraints and requirements:
-
Temporal split (no leakage):
-
Training data: rows with event_date ≤ 2025‑08‑25
-
Validation data: rows with event_date in 2025‑08‑26..2025‑09‑01 (inclusive)
-
Today is 2025‑09‑01
-
Preprocessing via ColumnTransformer:
-
Numeric pipeline: SimpleImputer(strategy='median') → StandardScaler()
-
Categorical pipeline: SimpleImputer(strategy='most_frequent') → OneHotEncoder(handle_unknown='ignore')
-
Classifier and hyperparameters:
-
LogisticRegression with class_weight='balanced'
-
Tune max_iter and perform a small hyperparameter search over C using StratifiedKFold CV on the training set only
-
Evaluation on the validation window:
-
Report ROC AUC and PR AUC
-
Choose a decision threshold that maximizes F1 on validation; report precision, recall, and F1 at that threshold
-
Probability calibration:
-
Use CalibratedClassifierCV on the training set only (CV=3), avoiding any validation leakage
-
Feature importance:
-
Compute permutation feature importance on the validation set and list the top 5 features by importance
-
Briefly explain one potential target leakage risk in this schema and how your pipeline avoids it.
Notes
-
Exclude user_id and event_date from model features.
-
Ensure reproducibility (fixed random seeds, deterministic splits).