PracHub
QuestionsPremiumLearningGuidesInterview PrepCoaches
|Home/Machine Learning/Airbnb

Build and evaluate an order prediction model

Last updated: May 3, 2026

Quick Overview

This question evaluates a data scientist's competency in building and evaluating binary classification models with temporal constraints and operational requirements, covering leakage-safe temporal validation, feature engineering groups, class imbalance handling, threshold selection for business precision/recall targets, metric reasoning under prevalence and label-window shifts, and deployment drift monitoring in the Machine Learning domain. It is commonly asked to assess the candidate's ability to design robust, leakage-free evaluation pipelines and translate business requirements into measurable model thresholds and monitoring plans, testing both conceptual understanding of evaluation and data-shift concepts and practical application in model validation and deployment.

  • medium
  • Airbnb
  • Machine Learning
  • Data Scientist

Build and evaluate an order prediction model

Company: Airbnb

Role: Data Scientist

Category: Machine Learning

Difficulty: medium

Interview Round: Technical Screen

Train a model to predict whether a guest will complete an order within 7 days of first session in the window. 1) Data/validation: propose a leakage-safe temporal split (train on 2025-06–2025-08 data, validate on 2025-08-24–2025-09-01), feature groups (recency/frequency, product interest, session quality, source/device), and handling of class imbalance (calibrated probabilities, focal loss or class weights). 2) Thresholding for business goals: require precision ≥0.80 while maximizing recall; describe how you’d pick a threshold using PR curves and calibration. 3) Confusion-matrix math (compute exactly): Assume 10,000 guests with 600 positives. At threshold t=0.70, TP=360, FP=140, FN=240, TN=9,260. Compute precision, recall, specificity, and F1. If t increases to 0.85 and counts become TP=300, FP=60, FN=300, TN=9,340, recompute the metrics and explain the monotonic changes. 4) Prevalence shift: Using TPR=360/600 and FPR=140/9,400 from t=0.70, if prevalence doubles to 10% but class-conditional score distributions are unchanged and you keep the same threshold, quantify precision via Precision = (TPR·π)/(TPR·π + FPR·(1−π)) and state whether recall changes; interpret the result. 5) Label-window change: Re-evaluate the same predictions against a 30-day label where 200 additional guests become positive between days 8–30; 50 of those were predicted positive and 150 predicted negative. Update counts and recompute precision and recall; explain why one can increase while the other decreases. 6) Online deployment: outline a drift/guardrail plan (calibration monitoring, SRM-like traffic checks, canary with CUPED-adjusted lift) and how you’d adapt thresholds by segment without leaking information.

Quick Answer: This question evaluates a data scientist's competency in building and evaluating binary classification models with temporal constraints and operational requirements, covering leakage-safe temporal validation, feature engineering groups, class imbalance handling, threshold selection for business precision/recall targets, metric reasoning under prevalence and label-window shifts, and deployment drift monitoring in the Machine Learning domain. It is commonly asked to assess the candidate's ability to design robust, leakage-free evaluation pipelines and translate business requirements into measurable model thresholds and monitoring plans, testing both conceptual understanding of evaluation and data-shift concepts and practical application in model validation and deployment.

Related Interview Questions

  • Design photo and listing quality models - Airbnb (medium)
Airbnb logo
Airbnb
Oct 13, 2025, 9:49 PM
Data Scientist
Technical Screen
Machine Learning
4
0
Loading...

Predict 7-Day Order Completion from First Session

You are building a binary classifier to predict whether a guest will complete an order within 7 days of their first session in an evaluation window. The index time is the guest's first session, and the label is whether an order occurs within 7 days of that session. Assume you have complete event logs and can construct features up to (but not after) the index time.

Tasks

  1. Data and validation design
    • Propose a leakage-safe temporal split using: train on 2025-06–2025-08 data, validate on 2025-08-24–2025-09-01. Ensure labels are fully matured for the 7-day horizon and that there is no time leakage.
    • List feature groups you would build (recency/frequency, product interest, session quality, source/device) with examples.
    • Describe how you would handle class imbalance (e.g., calibrated probabilities, focal loss or class weights) and which metrics you would optimize.
  2. Thresholding for business goals
    • The business requires precision ≥ 0.80 while maximizing recall. Describe how you would pick a probability threshold using PR curves and calibration.
  3. Confusion-matrix math
    • Assume 10,000 guests with 600 positives. At threshold t = 0.70, TP = 360, FP = 140, FN = 240, TN = 9,260. Compute precision, recall, specificity, and F1.
    • If t increases to 0.85 and counts become TP = 300, FP = 60, FN = 300, TN = 9,340, recompute the same metrics and explain the monotonic changes.
  4. Prevalence shift
    • Using TPR = 360/600 and FPR = 140/9,400 from t = 0.70, if prevalence doubles to π = 10% but class-conditional score distributions are unchanged and you keep the same threshold, quantify precision via Precision = (TPR·π) / (TPR·π + FPR·(1 − π)) and state whether recall changes. Interpret the result.
  5. Label-window change
    • Re-evaluate the same predictions against a 30-day label where 200 additional guests become positive between days 8–30; 50 of those were predicted positive and 150 predicted negative. Update counts and recompute precision and recall; explain why one can increase while the other decreases.
  6. Online deployment
    • Outline a drift/guardrail plan (calibration monitoring, SRM-like traffic checks, canary with CUPED-adjusted lift) and how you’d adapt thresholds by segment without leaking information.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Airbnb•More Data Scientist•Airbnb Data Scientist•Airbnb Machine Learning•Data Scientist Machine Learning
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.