CTR Prediction with Delayed Feedback and Extreme Class Imbalance
You are building a model to predict the probability that an ad impression results in a click within 24 hours. The base positive rate is approximately 0.7%.
Available features:
-
user_age, device_type, locale, time_of_day
-
ad_id (high-cardinality), campaign_id (high-cardinality)
-
past_7d_impressions, past_7d_clicks
-
referrer
Labels are delayed: some clicks arrive up to 24 hours after the impression.
Tasks
-
Modeling
-
Propose two model families suitable for extreme class imbalance and sparse/high-cardinality features.
-
Explain how you will encode ad_id/campaign_id without leakage.
-
Describe a time-based cross-validation scheme that respects the 24-hour label delay.
-
Imbalance Handling
-
Compare class weighting, focal loss, undersampling, and calibrated thresholding.
-
When would you avoid synthetic oversampling? Justify based on expected effects on ranking vs calibration.
-
Evaluation
-
Model A: ROC-AUC = 0.91, PR-AUC = 0.14. Model B: ROC-AUC = 0.88, PR-AUC = 0.22.
-
Explain why these can disagree at 0.7% prevalence, which metric you trust for email/ad CTR, and how to choose operating thresholds using a cost matrix (missed-click vs wasted impression).
-
Calibration and Thresholds
-
Describe how to assess and improve calibration (e.g., isotonic vs Platt) and select thresholds for:
a) maximizing F1, and
b) maximizing expected profit.
-
How would you compute precision@top1% and compare models on that metric?
-
Online Validation
-
Outline a bucket test (A/B) to validate lift using the model’s scores (e.g., top-k targeting).
-
What logs do you need to detect covariate drift and label delay in production, and how do you guard against feedback loops?