Click-Through Rate Prediction
Asked of: Machine Learning Engineer
Last updated
What's being tested
You must demonstrate end-to-end Machine Learning Engineering judgment for click-through rate (CTR) prediction: choosing appropriate probabilistic models, evaluation metrics, cross-validation and calibration strategies, and production concerns (serving, monitoring, drift). Interviewers probe both modeling choices (loss, class imbalance, feature transforms) and engineering tradeoffs (online/offline parity, latency, feature freshness) that determine whether a CTR model is safe and useful in production at scale for a platform like Reddit.
Core knowledge
-
Log loss (cross-entropy): explicit probabilistic objective It's a proper scoring rule and sensitive to calibration and extreme probabilities.
-
Ranking vs probability objectives: optimize AUC-PR or AUC-ROC for ranking quality (AUC-PR preferred under heavy class imbalance), but optimize log loss or calibration for downstream decisioning and auctions. Know when objectives diverge.
-
Calibration: measured by Expected Calibration Error (ECE) and Brier score; fixes include Platt scaling (sigmoid) and isotonic regression; calibration must be re-learned on production-distributed scores and monitored over time.
-
Class imbalance tactics: use sample weighting, downsampling negatives, or focal loss; beware that downsampling changes predicted probabilities and requires re-weighting or importance weighting at evaluation and serving.
-
Feature engineering: categorical high-cardinality handling via feature hashing, learned embeddings, or target encoding with leakage controls; always create time-aware encodings and guard against future-looking signals.
-
Cross-validation for temporal data: use time-based (forward) CV or user-heldout folds; avoid random CV when temporal drift or session correlation exists. For user-level leakage, partition by user id.
-
Delayed labels and censoring: clicks are observed with delays (e.g., impressions that get clicks later); account via cut-off windows, survival modeling, or label-delay correction to avoid target leakage.
-
Model families & scale: linear/logistic or shallow
XGBoost/LightGBMtrees for interpretable, low-latency scoring; deep models (wide & deep, attention) for richer signals but require embedding stores and higher latency. Quantify: single-nodeXGBoostup to tens of millions of rows; distributed training for 100M+. -
Feature store & parity: use a feature store like
Feastor in-house store to ensure online/offline parity, consistent joins, TTLs and freshness guarantees; cache precomputed features forp99latency control. -
Serving and latency tradeoffs: batching, quantization (e.g., 8-bit), model distillation, and feature caching reduce
p99; track inference SLA (e.g., 10–50ms tail) and measure end-to-end parity between offline metrics and online results. -
Monitoring & drift detection: monitor input feature distributions (PSI), label distribution, prediction distribution, calibration drift, and business-aligned metrics; alert on sustained PSI increases or calibration worsening.
-
Hyperparameter tuning & validation: use random/Bayesian search (
Optuna), early stopping on validation log loss, and holdout that reflects serving distribution; guard against test-set peeking and multiple-hypothesis overfitting.
Worked example — Build and evaluate click prediction models
First 30 seconds: clarify the label definition (impression-level binary click within what time window?), sampling strategy (are negatives downsampled?), and the production constraints (latency, model update cadence, target objective: calibrated probability vs ranking). A strong answer structures the solution around three pillars: data & labels (label windowing, handling delayed clicks, deduping impressions), modeling & evaluation (baseline logistic, tree ensembles like XGBoost, loss choice — log loss for probabilistic accuracy, AUC-PR for ranking), and production deployment (feature store, online/offline parity, calibration re-fit). Call out a concrete tradeoff: optimizing log loss yields well-calibrated probabilities but might slightly reduce AUC-based ranking; choose based on whether downstream systems use probabilities directly (e.g., auction) or only relative ranking. For validation use time-forward CV and a user-heldout fold to prevent leakage. Finish by saying what you'd do with more time: run an online A/B with counterfactual logging, continuous calibration pipelines, and drift detectors that retrain or rollback models automatically.
A second angle — serving-constrained CTR at low latency
If the constraint is a strict p99 latency (e.g., 20ms) or on-device scoring, you shift emphasis: prefer compact models (logistic or shallow trees), aggressive feature precomputation and caching, and model compression/quantization or distillation. Offline evaluation still uses the same calibration and CV techniques, but you must include an extra pillar: model-size vs performance tradeoff measured by latency budgets and memory. Also prioritize feature-store guarantees (low read latency, consistent feature versions) and design a fallback scoring path for cache misses. The core ML engineering thinking (label correctness, calibration, validation) stays identical — only the implementation constraints and optimization knobs change.
Common pitfalls
Pitfall: Evaluating with random cross-validation on temporally-evolving data. This produces over-optimistic metrics because future information leaks into training; prefer forward-time validation and user-heldout folds.
Pitfall: Downsampling negatives for training but reporting unadjusted probabilities. Downsampling alters base rates; you must correct predicted probabilities via importance weighting or fit a calibration layer on properly re-weighted validation data.
Pitfall: Ignoring online/offline parity. A model that sees richer offline features but those aren't available online will have unreliable production performance; always design and test with production feature availability and staleness.
Connections
Interviewers may pivot to ranking and recommender evaluation (NDCG, position bias) or to online experimentation infrastructure (bucket allocation, logging for counterfactual policy evaluation). They might also ask about feature-store architectures and how they affect real-time scoring tradeoffs.
Further reading
-
XGBoost: A Scalable Tree Boosting System (Chen & Guestrin, 2016) — implementation and tradeoffs for tree ensembles used widely in CTR tasks.
-
Predicting Good Probabilities with Supervised Learning (Niculescu-Mizil & Caruana, 2005) — empirical comparisons of calibration methods and scoring rules.
Practice questions
Related concepts
- CTR And Engagement MetricsAnalytics & Experimentation
- Cost-Sensitive Threshold OptimizationStatistics & Math
- Ads Ranking And Monetization AnalyticsAnalytics & Experimentation
- Classifier Evaluation, Calibration, And Thresholding
- Classification Thresholds, Imbalanced Learning And RiskMachine Learning
- Recommender Systems And Feed RankingMachine Learning