Build leak-safe sklearn model with calibration

Q: Build leak-safe sklearn model with calibration

This question evaluates a data scientist's practical competency in building leak-safe, end-to-end scikit-learn pipelines, covering feature preprocessing (including high-cardinality categorical strategies), temporal leakage prevention, time-aware cross-validation, class imbalance handling, probability calibration, hyperparameter tuning, evaluation with ROC-AUC and Brier score, model persistence, and post-deployment drift monitoring. It is commonly asked in Machine Learning interviews to verify applied engineering and model validation skills for time-dependent tabular data, testing the Machine Learning domain at a practical application level rather than purely conceptual understanding.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

You must build an end‑to‑end scikit‑learn pipeline to predict churn_28d at decision time t0 using only features available at or before t0 (no leakage). Data columns: user_id (str), snapshot_date (date, the t0 date), country (200 levels), device (6 levels), is_premium (bool), sessions_7d (int), spend_28d (float), avg_session_seconds_28d (float), days_since_signup (int), churn_28d (bool target). Requirements:

Use ColumnTransformer with SimpleImputer, StandardScaler for numeric, and OneHotEncoder(handle_unknown='ignore') for categoricals. Justify your handling of high‑cardinality country (e.g., hashing or rare-category bucketing) and implement one approach.
Prevent leakage: ensure all aggregations (like sessions_7d) are computed strictly up to snapshot_date and that cross‑validation respects time using TimeSeriesSplit with a temporal gap.
Address class imbalance (e.g., class_weight='balanced' or calibrated resampling within CV) and output well‑calibrated probabilities via CalibratedClassifierCV.
Tune hyperparameters via cross‑validated search; report ROC‑AUC and Brier score on a holdout time slice.
Provide concise Python code that defines the pipeline, CV, calibration, and evaluation, sets random_state for reproducibility, and demonstrates model persistence (joblib). Briefly explain how you would monitor drift post‑deployment.

Build leak-safe sklearn model with calibration

Overview

Comments (0)