Data Leakage and Time-Aware Validation
Asked of: Data Scientist
Last updated

-
What it is Data leakage is when information that wouldn’t be available at prediction time sneaks into training or validation, inflating offline metrics. A classic case is label/target leakage, where a feature indirectly reveals the true label. (developers.google.com) Time-aware validation evaluates models on time-ordered splits (e.g., expanding/rolling windows) so training never sees the future and metrics mimic real deployment. (sklearn.org)
-
Why interviewers ask about it At companies like Meta, models power ads, feed, and integrity systems; if validation leaks future info, offline wins won’t translate to online lift, burning experiment cycles and user trust. They’re testing whether you can design evaluations that survive delayed labels, backfills, and shifting data.
-
Core ideas to know
- Use only features available at inference time; freeze feature definitions as-of t. (developers.google.com)
- Never random-split time-ordered data; split so train < validation < test chronologically. (sklearn.org)
- Prefer walk-forward (rolling-origin) or expanding-window cross‑validation to simulate retraining and deployment. (robjhyndman.com)
- Add a gap/embargo between train and test windows to avoid overlap-induced leakage. (sklearn.org)
- Recompute feature engineering within each fold after splitting (e.g., scalers, encoders, rolling stats).
- Keep a final untouched, most‑recent holdout to estimate “go‑live” performance.
- Watch for proxies of the label (e.g., “delivered_at” when predicting “will_deliver”). (developers.google.com)
-
A common pitfall Candidates often compute rolling means, target encodings, or normalizers on the full dataset before splitting. That contaminates earlier timestamps with future information and makes validation look unrealistically good. Others use scikit-learn’s TimeSeriesSplit correctly but forget to refit preprocessing inside each fold, recreating the same leak. Be explicit about when features are computed and prove your split prevents training-on-future data. (sklearn.org)
-
Further reading
- scikit-learn: TimeSeriesSplit — API and examples, including the gap parameter for avoiding lookahead overlap. https://sklearn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html
- Google ML Crash Course: Monitoring pipelines — clear definition and examples of label leakage in production ML. https://developers.google.com/machine-learning/crash-course/production-ml-systems/monitoring
- Rob J Hyndman: Cross-validation for time series — explains rolling-origin (walk‑forward) evaluation and why it mirrors forecasting deployment. https://robjhyndman.com/hyndsight/tscv/