Build and evaluate an order prediction model
Company: Airbnb
Role: Data Scientist
Category: Machine Learning
Difficulty: medium
Interview Round: Technical Screen
Train a model to predict whether a guest will complete an order within 7 days of first session in the window. 1) Data/validation: propose a leakage-safe temporal split (train on 2025-06–2025-08 data, validate on 2025-08-24–2025-09-01), feature groups (recency/frequency, product interest, session quality, source/device), and handling of class imbalance (calibrated probabilities, focal loss or class weights). 2) Thresholding for business goals: require precision ≥0.80 while maximizing recall; describe how you’d pick a threshold using PR curves and calibration. 3) Confusion-matrix math (compute exactly): Assume 10,000 guests with 600 positives. At threshold t=0.70, TP=360, FP=140, FN=240, TN=9,260. Compute precision, recall, specificity, and F1. If t increases to 0.85 and counts become TP=300, FP=60, FN=300, TN=9,340, recompute the metrics and explain the monotonic changes. 4) Prevalence shift: Using TPR=360/600 and FPR=140/9,400 from t=0.70, if prevalence doubles to 10% but class-conditional score distributions are unchanged and you keep the same threshold, quantify precision via Precision = (TPR·π)/(TPR·π + FPR·(1−π)) and state whether recall changes; interpret the result. 5) Label-window change: Re-evaluate the same predictions against a 30-day label where 200 additional guests become positive between days 8–30; 50 of those were predicted positive and 150 predicted negative. Update counts and recompute precision and recall; explain why one can increase while the other decreases. 6) Online deployment: outline a drift/guardrail plan (calibration monitoring, SRM-like traffic checks, canary with CUPED-adjusted lift) and how you’d adapt thresholds by segment without leaking information.
Quick Answer: This question evaluates a data scientist's competency in building and evaluating binary classification models with temporal constraints and operational requirements, covering leakage-safe temporal validation, feature engineering groups, class imbalance handling, threshold selection for business precision/recall targets, metric reasoning under prevalence and label-window shifts, and deployment drift monitoring in the Machine Learning domain. It is commonly asked to assess the candidate's ability to design robust, leakage-free evaluation pipelines and translate business requirements into measurable model thresholds and monitoring plans, testing both conceptual understanding of evaluation and data-shift concepts and practical application in model validation and deployment.