Problem Framing: Regression Vs. Classification

What's being tested

The interviewer is checking a candidate's ability to frame a prediction task correctly as regression versus classification, pick appropriate baselines and evaluation metrics, and interpret differences in feature / label distributions. For a Machine Learning Engineer at Reddit, this means demonstrating judgment about model objectives tied to product metrics, choosing training/evaluation splits that respect temporal or cohort structure, and specifying production considerations like calibration, uncertainty, and cold-start handling. Expect the interviewer to probe tradeoffs (e.g., accuracy vs. calibration, point vs. distributional forecasts) and to ask how offline evaluation maps to online behavior.

Core knowledge

Regression vs. classification decision rule: prefer classification when outputs are categorical labels; use regression for continuous outcomes. For thresholded business decisions, consider predicting probability (classification) and calibrating threshold vs. predicting continuous score plus discretization.
Loss functions and interpretation: common regression losses: MSE (sensitive to outliers), MAE (robust to outliers), quantile loss for conditional medians/percentiles. Classification: log-loss (probabilistic), hinge loss (SVM), and ranking losses for ordering tasks.
Evaluation metrics mapped to objective: use RMSE/MAE/R^2 for regression; AUC, log-loss, precision/recall, and F_beta for classification. For imbalanced classes prefer PR-AUC and class-weighted metrics over accuracy.
Probabilistic vs. point predictions: when downstream needs uncertainty or thresholds, produce calibrated probabilities and evaluate with Brier score or Expected Calibration Error (ECE). For counts, use Poisson or Negative Binomial and compare via deviance.
Distributional issues & transformations: for skewed continuous targets, consider log or Box–Cox transforms, or model conditional quantiles. For zero-inflated targets, use zero-inflated or mixture models.
Baseline models: always report a simple baseline: global mean/median predictor for regression, class prior or stratified predictor for classification, and a time-series naive forecast when temporal drift exists.
Handling label / feature shift: detect with distributional tests (KS, chi-squared) and monitor covariate shift; prefer time-based splits for evaluation if concept drift or non-iid temporal ordering exists.
Cold-start and sparse data: use global-shared models, hierarchical pooling (Bayesian or embedding regularization), or content-based features; warm-start user/item embeddings via side information.
Model selection & validation: use k-fold where appropriate, but prefer time-based holdouts for temporal targets; cross-validate hyperparameters with nested CV when possible.
Calibration and thresholding: apply Platt scaling or isotonic regression for probability calibration; choose thresholds by optimizing expected downstream utility (not raw accuracy).
Production considerations: define latency/memory budgets; prefer XGBoost/light GBM for tabular speed, and ensure offline–online parity, label-latency handling, and retraining cadence aligned with drift.
Interpretability and debugging: use SHAP/partial dependence to explain feature effects; examine residuals by cohort and plot predicted vs actual conditional distributions to diagnose heteroscedasticity.

Worked example — Model y from x and interpret distributions

First 30s framing questions: clarify what y represents (continuous or bucketed), what product metric it maps to, label availability lag, and whether downstream decisions require calibrated probabilities or point estimates. Organize the answer into: (1) baseline and metric selection, (2) modeling approach, (3) evaluation protocol, and (4) production concerns. For baselines propose a global mean/median (MAE/RMSE) and a time-naive forecast if temporal. For modeling, contrast tree-based (XGBoost) with a probabilistic model (quantile regression or Bayesian linear/Poisson) if uncertainty matters. For evaluation, pick a time-based holdout, report MAE/RMSE plus residual distribution plots, and compute calibration if producing probabilities. A key tradeoff to flag: minimizing RMSE can overfit outliers and worsen calibration; choosing MAE or quantile loss may better match product risk tolerance. Close with: "If I had more time I'd run cohort-wise residual analysis, simulate downstream decision thresholds against estimated calibration, and prototype a serving sketch to validate offline–online parity."

A second angle

If the same y can be framed either as a continuous score or as a binary event (e.g., expected session length versus long-session flag), the exercise shifts toward utility-driven framing. The ML Engineer should articulate the downstream decision: if action is binary (show vs. don't show), prefer a calibrated classifier and optimize thresholds by expected utility; if action benefits from rank or magnitude, prefer regression or direct ranking loss. Constraints such as label sparsity, latency, or interpretability will push you toward different model families: simple calibrated logistic models for low-latency classifier; gradient-boosted regression for richer continuous predictions with feature interactions. Always justify evaluation splits and metrics by the downstream experiment or serving logic.

Common pitfalls

Pitfall: Choosing accuracy or R^2 blindly. Accuracy/R^2 can hide poor performance on minority classes or in tail regions; report calibrated probability and cohort-level errors aligned to product risk.

Often candidates pick MSE by habit; instead, tie loss to business cost (outlier penalties, false-positive costs) and show alternative choices (MAE, quantile loss, or asymmetric loss) with reasons.

Pitfall: Using random k-fold when data is temporal. This produces optimistic estimates when labels drift. Always prefer time-based holdouts or blocked CV for time-correlated data.

Communication error: not stating assumptions. If you assume iid data, state it; if you assume delayed labels or covariate shift, state how you'll detect and mitigate. Interviewers want those assumptions explicit.

Connections

This topic commonly leads to adjacent pivots: model calibration / uncertainty estimation (Monte Carlo dropout, ensembles), ranking and recommender evaluation (NDCG, position bias), and monitoring for data/model drift in production (concept shift detection, alerting rules).

What's being tested

Core knowledge

Worked example — Model y from x and interpret distributions

A second angle

Common pitfalls

Connections

Further reading

Practice questions

Related concepts