Design a CVR model for RTB bidding
Company: Tradedesk
Role: Data Scientist
Category: Machine Learning
Difficulty: easy
Interview Round: Technical Screen
You are interviewing for a DSP (e.g., The Trade Desk). Answer the following end-to-end product + ML case about **real-time bidding (RTB)**.
## Part A — RTB system understanding
1. Explain what **RTB** is and the roles of:
- **Advertiser**
- **Ad Exchange**
- **DSP**
2. When an ad opportunity (impression) arrives, walk through what happens in milliseconds.
3. How does the DSP decide:
- whether to bid?
- how much to bid?
- which ad/creative to show?
## Part B — Build a conversion-rate model (CVR)
You need a model to predict the **probability of conversion** for “Nike shoes” given an impression.
1. What training data would you use? (e.g., impressions, clicks, conversions). Define:
- What is a “conversion” event?
- What is the prediction target and time window (e.g., conversion within 7 days of impression)?
2. What features would you engineer from historical ad data? Include examples across:
- user/context, publisher/placement, device/geo/time, ad/creative, advertiser/campaign, frequency/recency, historical aggregates
3. What model would you choose and why?
- Compare **logistic regression** vs **tree-based models (e.g., LightGBM)** in this setting.
4. Loss function & optimization:
- What loss would you train on and why (e.g., log loss / binary cross-entropy)?
- Why not MSE?
- Why isn’t AUC typically used as a training loss?
- What does “predicting conversion over impression” mean for supervision and labeling?
- How do loss functions relate to bidding decisions?
## Part C — Practical ML concerns
1. **Class imbalance**: Conversions are rare.
- When would you use class weighting vs negative downsampling?
- What preprocessing should you avoid?
- How can imbalance handling affect probability **calibration**?
2. **Evaluation**:
- Offline: choose metrics (PR-AUC, ROC-AUC, log loss) and justify.
- Explain why PR-AUC can be more informative than ROC-AUC.
- Why does calibration matter?
- Online: how would you evaluate the model in production? What business metrics matter (e.g., CPA, ROAS, spend efficiency)?
3. **Precision/recall tradeoff in RTB**:
- How do false positives vs false negatives differ in cost?
- What is F1 score, and why might it be a poor objective for ad-tech bidding?
- How would you use a PR curve to select an operating point?
4. **Scalability & production**:
- Discuss training vs inference scalability for LightGBM.
- RTB latency constraints: what parts of the feature/model pipeline are bottlenecks?
- How would you deploy safely (shadow mode, ramp-up, rollback)?
5. **Overfitting & robustness**:
- Why is overfitting common in CVR prediction?
- How do you prevent it (regularization, early stopping, time-based validation, feature aggregation)?
- What monitoring and guardrails would you add for a bidding system?
Provide a structured, end-to-end answer with assumptions and tradeoffs.