Click-Through Rate Prediction

What's being tested

You must demonstrate end-to-end Machine Learning Engineering judgment for click-through rate (CTR) prediction: choosing appropriate probabilistic models, evaluation metrics, cross-validation and calibration strategies, and production concerns (serving, monitoring, drift). Interviewers probe both modeling choices (loss, class imbalance, feature transforms) and engineering tradeoffs (online/offline parity, latency, feature freshness) that determine whether a CTR model is safe and useful in production at scale for a platform like Reddit.

Core knowledge

Log loss (cross-entropy): explicit probabilistic objective $\text{LogLoss}=-\frac{1}{N}\sum_{i}[y_i\log \hat p_i+(1-y_i)\log(1-\hat p_i)].$ It's a proper scoring rule and sensitive to calibration and extreme probabilities.
Ranking vs probability objectives: optimize AUC-PR or AUC-ROC for ranking quality (AUC-PR preferred under heavy class imbalance), but optimize log loss or calibration for downstream decisioning and auctions. Know when objectives diverge.
Calibration: measured by Expected Calibration Error (ECE) and Brier score; fixes include Platt scaling (sigmoid) and isotonic regression; calibration must be re-learned on production-distributed scores and monitored over time.
Class imbalance tactics: use sample weighting, downsampling negatives, or focal loss; beware that downsampling changes predicted probabilities and requires re-weighting or importance weighting at evaluation and serving.
Feature engineering: categorical high-cardinality handling via feature hashing, learned embeddings, or target encoding with leakage controls; always create time-aware encodings and guard against future-looking signals.
Cross-validation for temporal data: use time-based (forward) CV or user-heldout folds; avoid random CV when temporal drift or session correlation exists. For user-level leakage, partition by user id.
Delayed labels and censoring: clicks are observed with delays (e.g., impressions that get clicks later); account via cut-off windows, survival modeling, or label-delay correction to avoid target leakage.
Model families & scale: linear/logistic or shallow XGBoost/LightGBM trees for interpretable, low-latency scoring; deep models (wide & deep, attention) for richer signals but require embedding stores and higher latency. Quantify: single-node XGBoost up to tens of millions of rows; distributed training for 100M+.
Feature store & parity: use a feature store like Feast or in-house store to ensure online/offline parity, consistent joins, TTLs and freshness guarantees; cache precomputed features for p99 latency control.
Serving and latency tradeoffs: batching, quantization (e.g., 8-bit), model distillation, and feature caching reduce p99; track inference SLA (e.g., 10–50ms tail) and measure end-to-end parity between offline metrics and online results.
Monitoring & drift detection: monitor input feature distributions (PSI), label distribution, prediction distribution, calibration drift, and business-aligned metrics; alert on sustained PSI increases or calibration worsening.
Hyperparameter tuning & validation: use random/Bayesian search (Optuna), early stopping on validation log loss, and holdout that reflects serving distribution; guard against test-set peeking and multiple-hypothesis overfitting.

Worked example — Build and evaluate click prediction models

First 30 seconds: clarify the label definition (impression-level binary click within what time window?), sampling strategy (are negatives downsampled?), and the production constraints (latency, model update cadence, target objective: calibrated probability vs ranking). A strong answer structures the solution around three pillars: data & labels (label windowing, handling delayed clicks, deduping impressions), modeling & evaluation (baseline logistic, tree ensembles like XGBoost, loss choice — log loss for probabilistic accuracy, AUC-PR for ranking), and production deployment (feature store, online/offline parity, calibration re-fit). Call out a concrete tradeoff: optimizing log loss yields well-calibrated probabilities but might slightly reduce AUC-based ranking; choose based on whether downstream systems use probabilities directly (e.g., auction) or only relative ranking. For validation use time-forward CV and a user-heldout fold to prevent leakage. Finish by saying what you'd do with more time: run an online A/B with counterfactual logging, continuous calibration pipelines, and drift detectors that retrain or rollback models automatically.

A second angle — serving-constrained CTR at low latency

If the constraint is a strict p99 latency (e.g., 20ms) or on-device scoring, you shift emphasis: prefer compact models (logistic or shallow trees), aggressive feature precomputation and caching, and model compression/quantization or distillation. Offline evaluation still uses the same calibration and CV techniques, but you must include an extra pillar: model-size vs performance tradeoff measured by latency budgets and memory. Also prioritize feature-store guarantees (low read latency, consistent feature versions) and design a fallback scoring path for cache misses. The core ML engineering thinking (label correctness, calibration, validation) stays identical — only the implementation constraints and optimization knobs change.

Common pitfalls

Pitfall: Evaluating with random cross-validation on temporally-evolving data. This produces over-optimistic metrics because future information leaks into training; prefer forward-time validation and user-heldout folds.

Pitfall: Downsampling negatives for training but reporting unadjusted probabilities. Downsampling alters base rates; you must correct predicted probabilities via importance weighting or fit a calibration layer on properly re-weighted validation data.

Pitfall: Ignoring online/offline parity. A model that sees richer offline features but those aren't available online will have unreliable production performance; always design and test with production feature availability and staleness.

Connections

Interviewers may pivot to ranking and recommender evaluation (NDCG, position bias) or to online experimentation infrastructure (bucket allocation, logging for counterfactual policy evaluation). They might also ask about feature-store architectures and how they affect real-time scoring tradeoffs.

What's being tested

Core knowledge

Worked example — Build and evaluate click prediction models

A second angle — serving-constrained CTR at low latency

Common pitfalls

Connections

Further reading

Practice questions

Related concepts