Design a CVR model for RTB bidding
Company: Tradedesk
Role: Data Scientist
Category: Machine Learning
Difficulty: easy
Interview Round: Technical Screen
##### Question
You are a data scientist at a Demand-Side Platform (DSP) such as The Trade Desk, participating in **Real-Time Bidding (RTB)**. For each ad opportunity (impression), your system must decide in tens of milliseconds **whether to bid, how much to bid, and which creative/ad to show**. You are asked to design an end-to-end ML approach to predict **conversion probability (CVR)** for a campaign (e.g., “Nike shoes”).
Historical data you may have:
- **Impressions table**: `impression_id`, timestamp, user/device/context, publisher/app/site, geo, auction metadata, bid price, win/loss, etc.
- **Clicks table**: `impression_id`, click timestamp (optional, sparse)
- **Conversions table**: `impression_id` (or user-level attribution key), conversion timestamp/value (very sparse)
Address the following:
1. **RTB system understanding.** Explain what RTB is and the roles of the **advertiser**, the **ad exchange**, and the **DSP**. When an ad opportunity arrives, walk through what happens in milliseconds. How does the DSP decide whether to bid, how much to bid, and which ad/creative to show?
2. **Learning target.** Clearly define what “predict conversion over impression” means here. Choose an attribution window and a precise label definition. What is the prediction unit and time window (e.g., conversion within 7 days of impression)? How do you handle attribution rules (last-click vs view-through) and label delay/censoring?
3. **Feature engineering.** Propose a realistic RTB feature set that is available **at bid time**, organized across user/context, publisher/placement, device/geo/time, ad/creative, advertiser/campaign, frequency/recency, and historical aggregates. Discuss leakage risks.
4. **Model choice.** Choose a baseline and a production candidate — compare **logistic regression** vs **gradient-boosted trees (e.g., LightGBM)** in this setting and explain the tradeoffs.
5. **Loss function.** What loss would you train on and why? Explain why you would use **log loss / binary cross-entropy**, why MSE is not appropriate, and why AUC is not used as a training loss. What does “predicting conversion over impression” mean for supervision/labeling, and how do loss functions relate to bidding decisions?
6. **Class imbalance.** Conversions are rare. Describe at least two ways to handle imbalance (e.g., class weighting, negative downsampling), when to use each, what preprocessing to avoid, and how these choices affect probability **calibration** (including how to correct for downsampling).
7. **Evaluation.** Define offline metrics and a validation scheme for CVR in a non-stationary ad-tech environment (PR-AUC, ROC-AUC, log loss, calibration). Explain why PR-AUC can be more informative than ROC-AUC and why calibration matters. How would you evaluate the model online, and what business metrics matter (e.g., CPA, ROAS, spend efficiency)?
8. **Precision/recall tradeoff in RTB.** How do false positives vs false negatives differ in cost? What is the F1 score, and why might it be a poor objective for ad-tech bidding? How would you use a PR curve to select an operating point?
9. **Scalability & production.** Discuss training vs inference scalability for LightGBM, the latency bottlenecks in the feature/model pipeline, and how you would deploy safely (shadow mode, ramp-up, rollback).
10. **Overfitting & robustness.** Why is overfitting common in CVR prediction, and how do you prevent it (regularization, early stopping, time-based validation, feature aggregation)? What monitoring and guardrails would you add for a live bidding system?
Provide a structured, end-to-end answer with explicit assumptions and tradeoffs.
Quick Answer: This Trade Desk data-science screen asks you to design an end-to-end conversion-rate (CVR) model for real-time bidding (RTB) at a DSP. It covers RTB system roles, label/attribution-window definition, bid-time feature engineering and leakage, logistic-regression vs LightGBM tradeoffs, log-loss training, class-imbalance handling with calibration, PR-AUC vs ROC-AUC evaluation, and latency-aware, safely-deployed production.