Build and assess CTR prediction

Q: Build and assess CTR prediction

This is a Machine Learning interview question from Uber for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

CTR Prediction with Delayed Feedback and Extreme Class Imbalance

You are building a model to predict the probability that an ad impression results in a click within 24 hours. The base positive rate is approximately 0.7%.

Available features:

user_age, device_type, locale, time_of_day
ad_id (high-cardinality), campaign_id (high-cardinality)
past_7d_impressions, past_7d_clicks
referrer

Labels are delayed: some clicks arrive up to 24 hours after the impression.

Tasks

Modeling
- Propose two model families suitable for extreme class imbalance and sparse/high-cardinality features.
- Explain how you will encode ad_id/campaign_id without leakage.
- Describe a time-based cross-validation scheme that respects the 24-hour label delay.
Imbalance Handling
- Compare class weighting, focal loss, undersampling, and calibrated thresholding.
- When would you avoid synthetic oversampling? Justify based on expected effects on ranking vs calibration.
Evaluation
- Model A: ROC-AUC = 0.91, PR-AUC = 0.14. Model B: ROC-AUC = 0.88, PR-AUC = 0.22.
- Explain why these can disagree at 0.7% prevalence, which metric you trust for email/ad CTR, and how to choose operating thresholds using a cost matrix (missed-click vs wasted impression).
Calibration and Thresholds
- Describe how to assess and improve calibration (e.g., isotonic vs Platt) and select thresholds for: a) maximizing F1, and b) maximizing expected profit.
- How would you compute precision@top1% and compare models on that metric?
Online Validation
- Outline a bucket test (A/B) to validate lift using the model’s scores (e.g., top-k targeting).
- What logs do you need to detect covariate drift and label delay in production, and how do you guard against feedback loops?

Build and assess CTR prediction

CTR Prediction with Delayed Feedback and Extreme Class Imbalance

Tasks

Solution (Locked)

Comments (0)