Problem Framing: Regression Vs. Classification
Asked of: Machine Learning Engineer
Last updated
What's being tested
The interviewer is checking a candidate's ability to frame a prediction task correctly as regression versus classification, pick appropriate baselines and evaluation metrics, and interpret differences in feature / label distributions. For a Machine Learning Engineer at Reddit, this means demonstrating judgment about model objectives tied to product metrics, choosing training/evaluation splits that respect temporal or cohort structure, and specifying production considerations like calibration, uncertainty, and cold-start handling. Expect the interviewer to probe tradeoffs (e.g., accuracy vs. calibration, point vs. distributional forecasts) and to ask how offline evaluation maps to online behavior.
Core knowledge
-
Regression vs. classification decision rule: prefer classification when outputs are categorical labels; use regression for continuous outcomes. For thresholded business decisions, consider predicting probability (classification) and calibrating threshold vs. predicting continuous score plus discretization.
-
Loss functions and interpretation: common regression losses:
MSE(sensitive to outliers),MAE(robust to outliers), quantile loss for conditional medians/percentiles. Classification:log-loss(probabilistic), hinge loss (SVM), and ranking losses for ordering tasks. -
Evaluation metrics mapped to objective: use
RMSE/MAE/R^2for regression;AUC,log-loss, precision/recall, andF_betafor classification. For imbalanced classes prefer PR-AUC and class-weighted metrics over accuracy. -
Probabilistic vs. point predictions: when downstream needs uncertainty or thresholds, produce calibrated probabilities and evaluate with
Brier scoreor Expected Calibration Error (ECE). For counts, use Poisson or Negative Binomial and compare via deviance. -
Distributional issues & transformations: for skewed continuous targets, consider log or Box–Cox transforms, or model conditional quantiles. For zero-inflated targets, use zero-inflated or mixture models.
-
Baseline models: always report a simple baseline: global mean/median predictor for regression, class prior or stratified predictor for classification, and a time-series naive forecast when temporal drift exists.
-
Handling label / feature shift: detect with distributional tests (KS, chi-squared) and monitor covariate shift; prefer time-based splits for evaluation if concept drift or non-iid temporal ordering exists.
-
Cold-start and sparse data: use global-shared models, hierarchical pooling (Bayesian or embedding regularization), or content-based features; warm-start user/item embeddings via side information.
-
Model selection & validation: use
k-fold where appropriate, but prefer time-based holdouts for temporal targets; cross-validate hyperparameters with nested CV when possible. -
Calibration and thresholding: apply Platt scaling or isotonic regression for probability calibration; choose thresholds by optimizing expected downstream utility (not raw accuracy).
-
Production considerations: define latency/memory budgets; prefer
XGBoost/light GBM for tabular speed, and ensure offline–online parity, label-latency handling, and retraining cadence aligned with drift. -
Interpretability and debugging: use
SHAP/partial dependence to explain feature effects; examine residuals by cohort and plot predicted vs actual conditional distributions to diagnose heteroscedasticity.
Worked example — Model y from x and interpret distributions
First 30s framing questions: clarify what y represents (continuous or bucketed), what product metric it maps to, label availability lag, and whether downstream decisions require calibrated probabilities or point estimates. Organize the answer into: (1) baseline and metric selection, (2) modeling approach, (3) evaluation protocol, and (4) production concerns. For baselines propose a global mean/median (MAE/RMSE) and a time-naive forecast if temporal. For modeling, contrast tree-based (XGBoost) with a probabilistic model (quantile regression or Bayesian linear/Poisson) if uncertainty matters. For evaluation, pick a time-based holdout, report MAE/RMSE plus residual distribution plots, and compute calibration if producing probabilities. A key tradeoff to flag: minimizing RMSE can overfit outliers and worsen calibration; choosing MAE or quantile loss may better match product risk tolerance. Close with: "If I had more time I'd run cohort-wise residual analysis, simulate downstream decision thresholds against estimated calibration, and prototype a serving sketch to validate offline–online parity."
A second angle
If the same y can be framed either as a continuous score or as a binary event (e.g., expected session length versus long-session flag), the exercise shifts toward utility-driven framing. The ML Engineer should articulate the downstream decision: if action is binary (show vs. don't show), prefer a calibrated classifier and optimize thresholds by expected utility; if action benefits from rank or magnitude, prefer regression or direct ranking loss. Constraints such as label sparsity, latency, or interpretability will push you toward different model families: simple calibrated logistic models for low-latency classifier; gradient-boosted regression for richer continuous predictions with feature interactions. Always justify evaluation splits and metrics by the downstream experiment or serving logic.
Common pitfalls
Pitfall: Choosing accuracy or
R^2blindly. Accuracy/R^2 can hide poor performance on minority classes or in tail regions; report calibrated probability and cohort-level errors aligned to product risk.
Often candidates pick MSE by habit; instead, tie loss to business cost (outlier penalties, false-positive costs) and show alternative choices (MAE, quantile loss, or asymmetric loss) with reasons.
Pitfall: Using random
k-fold when data is temporal. This produces optimistic estimates when labels drift. Always prefer time-based holdouts or blocked CV for time-correlated data.
Communication error: not stating assumptions. If you assume iid data, state it; if you assume delayed labels or covariate shift, state how you'll detect and mitigate. Interviewers want those assumptions explicit.
Connections
This topic commonly leads to adjacent pivots: model calibration / uncertainty estimation (Monte Carlo dropout, ensembles), ranking and recommender evaluation (NDCG, position bias), and monitoring for data/model drift in production (concept shift detection, alerting rules).
Further reading
-
Elements of Statistical Learning — classic reference on regression, classification, and model selection.
-
scikit-learn model evaluation docs — practical descriptions of metrics and cross-validation patterns.
-
"On Calibration of Modern Neural Networks" (Guo et al., 2017) — explains calibration issues and remedies for probabilistic classifiers.
Practice questions
Related concepts
- Machine Learning Model Design And EvaluationMachine Learning
- Classification Thresholds, Imbalanced Learning And RiskMachine Learning
- Logistic Regression, Regularization, And Imbalanced ClassificationMachine Learning
- Classifier Evaluation, Calibration, And Thresholding
- Experiment Design For Product MLBehavioral & Leadership
- Supervised ML Fundamentals, Evaluation And Feature EngineeringMachine Learning