You must build an end‑to‑end scikit‑learn pipeline to predict churn_28d at decision time t0 using only features available at or before t0 (no leakage). Data columns: user_id (str), snapshot_date (date, the t0 date), country (200 levels), device (6 levels), is_premium (bool), sessions_7d (int), spend_28d (float), avg_session_seconds_28d (float), days_since_signup (int), churn_28d (bool target).
Requirements:
-
Use ColumnTransformer with SimpleImputer, StandardScaler for numeric, and OneHotEncoder(handle_unknown='ignore') for categoricals. Justify your handling of high‑cardinality country (e.g., hashing or rare-category bucketing) and implement one approach.
-
Prevent leakage: ensure all aggregations (like sessions_7d) are computed strictly up to snapshot_date and that cross‑validation respects time using TimeSeriesSplit with a temporal gap.
-
Address class imbalance (e.g., class_weight='balanced' or calibrated resampling within CV) and output well‑calibrated probabilities via CalibratedClassifierCV.
-
Tune hyperparameters via cross‑validated search; report ROC‑AUC and Brier score on a holdout time slice.
-
Provide concise Python code that defines the pipeline, CV, calibration, and evaluation, sets random_state for reproducibility, and demonstrates model persistence (joblib). Briefly explain how you would monitor drift post‑deployment.