Construct a Churn-Prediction Pipeline Using Scikit-Learn
Quick Overview
This question tests practical machine learning engineering skills, specifically the ability to construct an end-to-end classification pipeline for imbalanced tabular data. It evaluates knowledge of preprocessing, model validation, probability calibration, and production packaging — core competencies assessed in data scientist interviews.
Construct a Churn-Prediction Pipeline Using Scikit-Learn
Company: Apple
Role: Data Scientist
Category: Machine Learning
Difficulty: medium
Interview Round: Technical Screen
##### Scenario
Building a churn-prediction pipeline for a subscription business using scikit-learn.
##### Question
Describe, step-by-step, how you would construct, train, validate, and evaluate a churn-prediction model in scikit-learn, including preprocessing, model choice, hyper-parameter tuning, and packaging the final pipeline for production.
##### Hints
Mention Pipeline, ColumnTransformer, GridSearchCV, cross-validation, joblib.
Quick Answer: This question tests practical machine learning engineering skills, specifically the ability to construct an end-to-end classification pipeline for imbalanced tabular data. It evaluates knowledge of preprocessing, model validation, probability calibration, and production packaging — core competencies assessed in data scientist interviews.
Construct a Churn-Prediction Pipeline in scikit-learn
Scenario
You are a data scientist on a subscription business. You need to build a model that predicts customer churn, defined as: will a currently-active customer cancel or go inactive within the next 30 days?
The training data is tabular with a mix of numeric and categorical features (e.g., tenure, monthly spend, plan type, region, support-ticket counts). The positive class (churners) is imbalanced — far fewer churners than non-churners. Some features change over time, and the company wants to act on the model's output by sending retention offers, so the cost of a false positive (a wasted offer) differs from the cost of a false negative (a lost customer).
Walk through, end to end, how you would construct, train, validate, evaluate, and ship this churn model in scikit-learn. Your answer should make concrete use of scikit-learn's Pipeline, ColumnTransformer, GridSearchCV, cross-validation utilities, and joblib. The problem is broken into parts below; treat them as one coherent pipeline rather than seven disconnected answers.
Constraints & Assumptions
Library:
scikit-learn (you may reference
pandas
/
numpy
), with the model packaged for a Python serving environment.
Label horizon:
churn within a fixed
30-day prediction window
following a snapshot (the "as-of" date).
Imbalance:
churners are roughly
5–20%
of rows; treat the exact rate as data-dependent, not a fixed number.
Features:
mixed numeric + categorical; some categoricals are
high-cardinality
, and
unseen categories
can appear at inference time.
Acting on predictions:
downstream, a churn score triggers a retention action, so
calibrated probabilities
and a
business-driven decision threshold
matter, not just a ranking.
Reproducibility:
fix random seeds; the same preprocessing must run identically in training and production.
Clarifying Questions to Ask
How is churn
operationally defined
— hard cancellation only, or also "inactive" (no logins/usage), and over what exact window relative to the as-of date?
Is the data
time-dependent
(do we have multiple monthly snapshots per customer), or is it a single cross-sectional snapshot? This decides whether splits and CV must be
chronological
.
What is the
downstream action and its economics
— what does a retention offer cost, and what is the value of a saved customer? This sets the operating threshold and the metric we optimize.
What is the
prediction cadence and latency budget
(nightly batch scoring vs. real-time), and what infrastructure will load the model?
Are there
fairness, regulatory, or feature-availability constraints
(e.g., a feature that exists in training but is not reliably available at serving time)?
How will we measure success
after deployment
— is there a holdout / A-B test for the retention campaign, not just offline AUC?
Part 1 — Problem framing, label construction, and leakage prevention
Define the target precisely and lay out how you build the labeled dataset so that no information from the future (after the as-of date) leaks into the features. Describe what split strategy you use and why.
What This Part Should Cover
A precise, time-anchored definition of
y
and the feature cutoff.
Identification of concrete leakage risks (target, identity, window overlap) and how each is closed.
A justified choice between stratified vs. out-of-time splitting tied to whether the data is time-dependent.
Holding out a final test set that is never touched during model selection.
Part 2 — Preprocessing numeric and categorical features
Build the feature-preparation layer. Explain how you handle missing values, scaling, and encoding for the two column types, and how this is wired so it cannot leak.
What This Part Should Cover
Numeric path: imputation strategy and scaling (and
when
scaling matters — linear/distance models vs. trees).
Categorical path: imputation + an encoding that tolerates
unseen
categories at serve time.
Containing all transforms inside the pipeline to prevent train/validation leakage.
A plan for
high-cardinality
categoricals (rare-level grouping,
min_frequency
, hashing, or target/ordinal encoding with care).
Part 3 — Baselines and model choice
Choose what to model with. Start from a defensible baseline before reaching for something heavier, and say how each option deals with class imbalance.
What This Part Should Cover
A true baseline (majority/prior) to contextualize any "good" AUC.
An interpretable linear model and a non-linear ensemble, with the trade-offs (interpretability/calibration vs. raw performance).
Per-model handling of imbalance (
class_weight
,
sample_weight
, resampling, or thresholding) and awareness of which estimators support which knob.
Part 4 — Hyperparameter tuning with cross-validation
Tune the chosen pipeline with cross-validation. Specify the CV scheme, the search, and the scoring metric, and explain why the search must wrap the whole pipeline.
What This Part Should Cover
CV scheme consistent with Part 1's split decision (stratified vs. temporal).
A search over both preprocessing and model parameters using
__
-addressed names.
A scoring metric appropriate to imbalance (PR-AUC / ROC-AUC, not accuracy), and how
refit
selects the final estimator.
Evaluate the tuned model on the untouched test set and turn probabilities into decisions. Explain which metrics you report and how you pick the operating threshold.
What This Part Should Cover
Final evaluation on the held-out test set (not CV scores) with imbalance-aware metrics.
A principled,
cost-aware
threshold choice rather than a hard-coded 0.5.
Connecting the threshold back to the retention economics from the clarifying questions.
Part 6 — Probability calibration
The downstream retention logic uses the probability, so explain when and how you calibrate, and how you verify calibration.
What This Part Should Cover
Recognizing which models need calibration and why it matters when probabilities drive decisions.
A way to
measure
calibration quality (reliability diagram, Brier score), not just assert it.
Part 7 — Packaging the final pipeline for production
Ship the artifact. Describe what you persist, how scoring works at serve time, and what schema/robustness concerns you handle.
What This Part Should Cover
Serializing the complete fitted pipeline + threshold + schema/metadata as one bundle.
A clean scoring path that aligns input columns to the training schema and emits both probability and decision.
Operational concerns: version pinning, input validation, and reproducibility.
What a Strong Answer Covers
Across all parts, a strong answer treats this as one leakage-safe, reproducible pipeline rather than seven isolated snippets, and keeps coming back to the cross-cutting threads:
Leakage discipline end to end
— the as-of/label boundary in Part 1 is honored by fitting all preprocessing inside the CV folds in Parts 2 and 4.
Imbalance and business cost as a through-line
— the same retention economics inform metric choice (Part 4), threshold (Part 5), and the case for calibration (Part 6).
Reproducibility & production-readiness
— seeds, version pinning, schema validation, and a single serializable artifact.
Post-deployment thinking
— monitoring for data/concept drift, score-distribution and calibration stability, and measuring real retention lift, not just offline AUC.
Follow-up Questions
Suppose churn rate
drifts
over time (seasonality, a pricing change). How would you detect this in production and decide when to retrain, and would you change your validation scheme?
A stakeholder asks
why
a specific customer was flagged. How would you produce per-prediction explanations (e.g., coefficients for the linear model, or SHAP for the tree model) without breaking the pipeline abstraction?
The retention team can only contact the
top 2% of customers per day
(a capacity constraint). How does that change which metric you optimize and how you set the threshold (think ranking / precision@k rather than a fixed cutoff)?
If you replaced
GridSearchCV
with a more expensive search and a larger model, how would you guard against
overfitting the model-selection process itself
(e.g., nested cross-validation)?