PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/Machine Learning/Apple

Build leak-safe sklearn model with calibration

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a data scientist's practical competency in building leak-safe, end-to-end scikit-learn pipelines, covering feature preprocessing (including high-cardinality categorical strategies), temporal leakage prevention, time-aware cross-validation, class imbalance handling, probability calibration, hyperparameter tuning, evaluation with ROC-AUC and Brier score, model persistence, and post-deployment drift monitoring. It is commonly asked in Machine Learning interviews to verify applied engineering and model validation skills for time-dependent tabular data, testing the Machine Learning domain at a practical application level rather than purely conceptual understanding.

  • Medium
  • Apple
  • Machine Learning
  • Data Scientist

Build leak-safe sklearn model with calibration

Company: Apple

Role: Data Scientist

Category: Machine Learning

Difficulty: Medium

Interview Round: Technical Screen

You must build an end‑to‑end scikit‑learn pipeline to predict churn_28d at decision time t0 using only features available at or before t0 (no leakage). Data columns: user_id (str), snapshot_date (date, the t0 date), country (200 levels), device (6 levels), is_premium (bool), sessions_7d (int), spend_28d (float), avg_session_seconds_28d (float), days_since_signup (int), churn_28d (bool target). Requirements: - Use ColumnTransformer with SimpleImputer, StandardScaler for numeric, and OneHotEncoder(handle_unknown='ignore') for categoricals. Justify your handling of high‑cardinality country (e.g., hashing or rare-category bucketing) and implement one approach. - Prevent leakage: ensure all aggregations (like sessions_7d) are computed strictly up to snapshot_date and that cross‑validation respects time using TimeSeriesSplit with a temporal gap. - Address class imbalance (e.g., class_weight='balanced' or calibrated resampling within CV) and output well‑calibrated probabilities via CalibratedClassifierCV. - Tune hyperparameters via cross‑validated search; report ROC‑AUC and Brier score on a holdout time slice. - Provide concise Python code that defines the pipeline, CV, calibration, and evaluation, sets random_state for reproducibility, and demonstrates model persistence (joblib). Briefly explain how you would monitor drift post‑deployment.

Quick Answer: This question evaluates a data scientist's practical competency in building leak-safe, end-to-end scikit-learn pipelines, covering feature preprocessing (including high-cardinality categorical strategies), temporal leakage prevention, time-aware cross-validation, class imbalance handling, probability calibration, hyperparameter tuning, evaluation with ROC-AUC and Brier score, model persistence, and post-deployment drift monitoring. It is commonly asked in Machine Learning interviews to verify applied engineering and model validation skills for time-dependent tabular data, testing the Machine Learning domain at a practical application level rather than purely conceptual understanding.

Related Interview Questions

  • Implement Masked Multi-Head Self-Attention - Apple (easy)
  • Compare DCN v1 vs v2 and A/B test - Apple (medium)
  • Explain dataset size, generalization, and U-Net skips - Apple (medium)
  • Analyze vision model failures - Apple (medium)
  • Compare audio preprocessing and training - Apple (medium)
|Home/Machine Learning/Apple

Build leak-safe sklearn model with calibration

Apple logo
Apple
Oct 13, 2025, 9:49 PM
MediumData ScientistTechnical ScreenMachine Learning
5
0

You must build an end‑to‑end scikit‑learn pipeline to predict churn_28d at decision time t0 using only features available at or before t0 (no leakage). Data columns: user_id (str), snapshot_date (date, the t0 date), country (200 levels), device (6 levels), is_premium (bool), sessions_7d (int), spend_28d (float), avg_session_seconds_28d (float), days_since_signup (int), churn_28d (bool target). Requirements:

  • Use ColumnTransformer with SimpleImputer, StandardScaler for numeric, and OneHotEncoder(handle_unknown='ignore') for categoricals. Justify your handling of high‑cardinality country (e.g., hashing or rare-category bucketing) and implement one approach.
  • Prevent leakage: ensure all aggregations (like sessions_7d) are computed strictly up to snapshot_date and that cross‑validation respects time using TimeSeriesSplit with a temporal gap.
  • Address class imbalance (e.g., class_weight='balanced' or calibrated resampling within CV) and output well‑calibrated probabilities via CalibratedClassifierCV.
  • Tune hyperparameters via cross‑validated search; report ROC‑AUC and Brier score on a holdout time slice.
  • Provide concise Python code that defines the pipeline, CV, calibration, and evaluation, sets random_state for reproducibility, and demonstrates model persistence (joblib). Briefly explain how you would monitor drift post‑deployment.
Loading comments...

Browse More Questions

More Machine Learning•More Apple•More Data Scientist•Apple Data Scientist•Apple Machine Learning•Data Scientist Machine Learning

Write your answer

Your first approved answer each day earns 20 XP.

Sign in to write your answer.
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.