Build leak-safe sklearn model with calibration

Q: Build leak-safe sklearn model with calibration

This is a Machine Learning interview question from Apple for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

You must build an end‑to‑end scikit‑learn pipeline to predict churn_28d at decision time t0 using only features available at or before t0 (no leakage). Data columns: user_id (str), snapshot_date (date, the t0 date), country (200 levels), device (6 levels), is_premium (bool), sessions_7d (int), spend_28d (float), avg_session_seconds_28d (float), days_since_signup (int), churn_28d (bool target). Requirements:

Use ColumnTransformer with SimpleImputer, StandardScaler for numeric, and OneHotEncoder(handle_unknown='ignore') for categoricals. Justify your handling of high‑cardinality country (e.g., hashing or rare-category bucketing) and implement one approach.
Prevent leakage: ensure all aggregations (like sessions_7d) are computed strictly up to snapshot_date and that cross‑validation respects time using TimeSeriesSplit with a temporal gap.
Address class imbalance (e.g., class_weight='balanced' or calibrated resampling within CV) and output well‑calibrated probabilities via CalibratedClassifierCV.
Tune hyperparameters via cross‑validated search; report ROC‑AUC and Brier score on a holdout time slice.
Provide concise Python code that defines the pipeline, CV, calibration, and evaluation, sets random_state for reproducibility, and demonstrates model persistence (joblib). Briefly explain how you would monitor drift post‑deployment.

Build leak-safe sklearn model with calibration

Comments (0)