Build a Predictive Model for a Product Metric
You are interviewing for a data scientist role and are asked to design a predictive model for a key product metric in a consumer app, such as predicting whether a user will send a message, complete sign-up, or perform another target action.
Constraints & Assumptions
-
Treat this as a statistics and machine-learning interview question.
-
Assume you have historical product logs, user/session attributes, timestamps, and the target action.
-
The model is for binary classification unless the interviewer specifies a different target.
-
Explain both practical modeling steps and key mathematical concepts.
Clarifying Questions to Ask
-
What is the exact target action and prediction horizon?
-
What decision will the model support: ranking, targeting, forecasting, or diagnosis?
-
What features are available at prediction time?
-
What are the costs of false positives and false negatives?
Part 1 - Define Target and Features
How would you define the prediction problem, target variable, and feature space?
What This Part Should Cover
-
Unit of prediction, prediction time, label window, and positive or negative class.
-
Feature groups such as historical behavior, recency/frequency, device, context, network, campaign, and product interactions.
-
Leakage prevention and cold-start considerations.
Part 2 - Prepare Data and Splits
Describe preprocessing and how you would set up train, validation, and test splits.
What This Part Should Cover
-
Missing values, outliers, categorical encoding, scaling when needed, class imbalance, and duplicate handling.
-
Time-based splits to mimic future prediction and avoid leakage.
-
Cross-validation or holdout strategy appropriate to the data volume and temporal drift.
Part 3 - Explain Logistic Regression
Write down the mathematical form of the logistic function and explain why it is appropriate for binary classification.
What This Part Should Cover
-
The form
P(y=1|x) = 1 / (1 + exp(-(beta_0 + beta^T x)))
.
-
Mapping linear scores to probabilities between 0 and 1.
-
Interpretability, calibration, log-loss training, and use as a baseline.
Part 4 - Explain Random Forests
What is "random" in Random Forests, and why does it help?
What This Part Should Cover
-
Bootstrap sampling of rows and random feature subsets at splits.
-
Ensemble averaging across trees to reduce variance.
-
Strengths and limitations compared with logistic regression.
Part 5 - Evaluate and Iterate
How would you evaluate the model and improve it after the first version?
What This Part Should Cover
-
Metrics such as AUC, PR-AUC, log loss, calibration, lift, precision/recall, and business impact at a threshold.
-
Segment-level error analysis, feature importance, robustness, drift, and monitoring.
-
Online experiment or decision-impact measurement before relying on the model in production.
What a Strong Answer Covers
A strong answer clearly defines the prediction setup, prevents leakage, explains logistic regression and Random Forests accurately, evaluates with both statistical and business metrics, and describes iteration through experiments and monitoring.
Follow-up Questions
-
How would you choose a classification threshold?
-
What if the model is well-calibrated overall but poor for new users?
-
When would you prefer logistic regression over Random Forests?