Build leak-safe sklearn model with calibration
Company: Apple
Role: Data Scientist
Category: Machine Learning
Difficulty: Medium
Interview Round: Technical Screen
You must build an end‑to‑end scikit‑learn pipeline to predict churn_28d at decision time t0 using only features available at or before t0 (no leakage). Data columns: user_id (str), snapshot_date (date, the t0 date), country (200 levels), device (6 levels), is_premium (bool), sessions_7d (int), spend_28d (float), avg_session_seconds_28d (float), days_since_signup (int), churn_28d (bool target).
Requirements:
- Use ColumnTransformer with SimpleImputer, StandardScaler for numeric, and OneHotEncoder(handle_unknown='ignore') for categoricals. Justify your handling of high‑cardinality country (e.g., hashing or rare-category bucketing) and implement one approach.
- Prevent leakage: ensure all aggregations (like sessions_7d) are computed strictly up to snapshot_date and that cross‑validation respects time using TimeSeriesSplit with a temporal gap.
- Address class imbalance (e.g., class_weight='balanced' or calibrated resampling within CV) and output well‑calibrated probabilities via CalibratedClassifierCV.
- Tune hyperparameters via cross‑validated search; report ROC‑AUC and Brier score on a holdout time slice.
- Provide concise Python code that defines the pipeline, CV, calibration, and evaluation, sets random_state for reproducibility, and demonstrates model persistence (joblib). Briefly explain how you would monitor drift post‑deployment.
Quick Answer: This question evaluates a data scientist's practical competency in building leak-safe, end-to-end scikit-learn pipelines, covering feature preprocessing (including high-cardinality categorical strategies), temporal leakage prevention, time-aware cross-validation, class imbalance handling, probability calibration, hyperparameter tuning, evaluation with ROC-AUC and Brier score, model persistence, and post-deployment drift monitoring. It is commonly asked in Machine Learning interviews to verify applied engineering and model validation skills for time-dependent tabular data, testing the Machine Learning domain at a practical application level rather than purely conceptual understanding.