Applied Machine Learning Modeling And Evaluation

What's being tested

Interviewers are probing whether you can turn an ambiguous product or integrity problem into a defensible applied machine learning plan: define the prediction target, construct labels, choose features, evaluate offline and online, set decision thresholds, and monitor outcomes after launch. At Meta scale, a Data Scientist is expected to reason about model quality through business and user metrics, not just AUC, because targeting, ranking, fraud, and location inference all create asymmetric costs and feedback loops. Strong answers show statistical judgment: how you handle biased labels, calibration, uncertainty, subgroup performance, and tradeoffs between engagement, revenue, safety, privacy, and fairness. The interviewer is not looking for production pipeline architecture; they are looking for whether your modeling choices would lead to better decisions.

Core knowledge

Problem formulation comes before model choice. State whether the task is classification, regression, ranking, uplift modeling, or multi-objective optimization. For rollout targeting, predicting “will use feature” is different from predicting “incremental lift if exposed,” which requires treatment-control data and estimates like $E[Y(1)-Y(0)\mid X]$ .
Label design is often the hardest part. A fraud label from chargebacks is delayed and biased toward detected fraud; a “home vs office vs public” label may rely on weak supervision from repeated nighttime presence, user-declared signals, or aggregate patterns. Always discuss label noise, time windows, leakage, and whether negatives are true negatives or merely unlabeled positives.
Feature engineering should map to causal or predictive mechanisms. For ads or shopping ranking, useful DS-level features include user affinity, item quality, price competitiveness, historical conversion rate, seller reliability, freshness, social proof, and query/context match. For privacy-sensitive inference, prefer aggregated, coarse, consented, and non-identifying features over raw location traces.
Train/validation/test splits must reflect deployment. Random splits can overstate performance when users, sellers, devices, or locations repeat across rows. Use time-based splits, user-level holdouts, seller-level holdouts, or geo-level validation when generalization across future behavior or unseen entities matters.
Baseline models are essential. Start with interpretable baselines such as logistic regression, regularized linear models, or simple scorecards, then compare to XGBoost, random forests, or neural ranking models if nonlinear interactions matter. A complex model is only justified if it improves decision quality, calibration, subgroup robustness, or ranking metrics.
Offline metrics should match the decision. For binary classification, report ROC-AUC, PR-AUC, precision, recall, false positive rate, false negative rate, and calibration. For ranking, use NDCG@K, MAP@K, MRR, expected revenue, conversion-weighted utility, and guardrails such as hide/report rates or buyer dissatisfaction.
Cost-sensitive evaluation is critical when errors are asymmetric. Define expected cost:
$\text{Expected Cost}(t)=C_{FP}\cdot FP(t)+C_{FN}\cdot FN(t)+C_{review}\cdot N_{review}(t)$
Choose thresholds based on business harm, user harm, review capacity, or risk tier rather than maximizing accuracy.
Calibration matters whenever scores drive thresholds, prioritization, or expected value. A calibrated model satisfies $P(Y=1\mid \hat p=s)\approx s$ . Use reliability curves, ECE, Brier score, Platt scaling, isotonic regression, and segment-level calibration checks by country, device type, traffic source, seller size, or user tenure.
Selection bias appears in ranking and rollout systems. Historical clicks and conversions are observed only for items users saw, so naïve training learns exposure policy artifacts. Discuss randomized exploration buckets, inverse propensity weighting, counterfactual evaluation, or interleaving tests when evaluating new ranking logic.
Multi-objective ranking requires explicit utility design. For shopping, a score might combine predicted purchase value, user satisfaction, seller quality, integrity risk, and diversity:
$S = w_1P(\text{purchase})\cdot \text{margin} + w_2P(\text{save}) - w_3P(\text{fraud}) - w_4P(\text{negative feedback})$
A strong answer explains how weights are set, constrained, and tested.
Fairness and subgroup analysis are model evaluation responsibilities. Check performance by protected or sensitive-adjacent groups where appropriate, plus operational segments like new users, small sellers, low-connectivity regions, and sparse-history users. Look for disparities in false positive rates, ranking exposure, calibration, and downstream outcomes.
Online evaluation closes the loop. Offline wins do not guarantee product wins because models change user behavior. Use A/B tests with primary metrics, guardrails, ramp plans, novelty effects, and long-term holdouts where needed. Monitor drift in feature distributions, score distributions, calibration, precision at actioned thresholds, and product metrics after launch.

Worked example

For “Evaluate fraud classifier with cost-sensitive metrics,” a strong candidate would start by clarifying the action: are high-risk cases blocked automatically, sent to manual review, stepped up for verification, or merely downranked? They would ask what counts as fraud, how labels arrive, the delay in confirmation, the cost of a false positive to legitimate users, and the cost of a false negative to the platform. The answer can be organized into four pillars: label and data quality, offline model evaluation, threshold and decision policy, and online monitoring. For offline evaluation, they would not stop at ROC-AUC; they would emphasize PR-AUC if fraud is rare, precision/recall at operational thresholds, calibration curves, and segment-level false positive rates.

The candidate should define a cost function such as $C_{FP}$ for blocking a legitimate user, $C_{FN}$ for missed fraud, and $C_{review}$ for human review, then choose thresholds that minimize expected cost subject to capacity or safety constraints. A key tradeoff is whether to use a single global threshold or risk-tiered thresholds by transaction amount, account age, country, or seller history; tiering can improve utility but may create fairness and calibration concerns. They should also discuss delayed labels: recent “non-fraud” examples may simply not have matured, so evaluation should use a label window long enough to avoid optimistic estimates. They would close by saying that, with more time, they would run an online shadow test or limited ramp to compare modeled risk against actual downstream losses, user appeals, and support contacts.

A second angle

For “Optimize IG Shopping ranking with multiple objectives,” the same modeling-evaluation discipline applies, but the unit of decision is an ordered set of items rather than a binary action. Instead of choosing one fraud threshold, you are combining predicted purchase probability, long-term user satisfaction, seller quality, diversity, and integrity risk into a ranking objective. Offline metrics like NDCG@K or conversion lift are useful, but biased because historical exposure determines what outcomes were observed. A strong answer would introduce counterfactual evaluation, randomized exploration, and online A/B testing with guardrails such as user hides, seller concentration, refund rates, and low-quality purchase signals. The framing shifts from “minimize classification cost” to “maximize constrained expected utility under feedback loops.”

Common pitfalls

Pitfall: Optimizing for accuracy or ROC-AUC without connecting the metric to the decision.

This is especially weak for rare events like fraud or high-stakes targeting, where a model can achieve high accuracy by predicting the majority class. A better answer defines the action, the cost of each error, and the threshold or ranking policy that will be evaluated.

Pitfall: Treating observed labels as ground truth without discussing bias.

Clicks, conversions, fraud reports, and inferred place types are all partially observed and shaped by previous systems. Strong candidates explicitly call out delayed labels, missing positives, exposure bias, sample selection, and label noise, then propose practical mitigations like matured evaluation windows, audits, exploration data, or weak-supervision confidence scores.

Pitfall: Jumping to model architecture before framing the product and statistical problem.

Saying “I’d train XGBoost with many features” is not enough. Interviewers want to hear how you define success, prevent leakage, evaluate subgroups, calibrate probabilities, choose decision thresholds, and validate online impact.

Connections

This topic often pivots into experimentation, especially how to test a new targeting or ranking model with A/B metrics and guardrails. It also connects to causal inference for uplift modeling, metric design for multi-objective tradeoffs, and responsible AI for privacy, fairness, and subgroup reliability.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts