This question evaluates a data scientist's practical competency in building leak-safe, end-to-end scikit-learn pipelines, covering feature preprocessing (including high-cardinality categorical strategies), temporal leakage prevention, time-aware cross-validation, class imbalance handling, probability calibration, hyperparameter tuning, evaluation with ROC-AUC and Brier score, model persistence, and post-deployment drift monitoring. It is commonly asked in Machine Learning interviews to verify applied engineering and model validation skills for time-dependent tabular data, testing the Machine Learning domain at a practical application level rather than purely conceptual understanding.
You must build an end‑to‑end scikit‑learn pipeline to predict churn_28d at decision time t0 using only features available at or before t0 (no leakage). Data columns: user_id (str), snapshot_date (date, the t0 date), country (200 levels), device (6 levels), is_premium (bool), sessions_7d (int), spend_28d (float), avg_session_seconds_28d (float), days_since_signup (int), churn_28d (bool target). Requirements: