Detect Data Leakage in Supervised Learning Pipelines
Company: Boston Consulting Group
Role: Data Scientist
Category: Machine Learning
Difficulty: hard
Interview Round: Take-home Project
##### Scenario
Company screens ML engineers with a 90-minute CodeSignal test containing conceptual MCQs and Python modeling tasks.
##### Question
State and interpret the bias and variance terms in the bias–variance decomposition. Which regularization technique(s) can shrink linear-model coefficients exactly to zero and why? Name two practical approaches for detecting data leakage in a supervised learning pipeline. Given dataframe df(user_id, event_time, event_type, purchase), build a binary classifier predicting whether a user will purchase within the next 7 days and report AUC on a held-out set. Implement logistic regression with gradient descent using only numpy; provide convergence diagnostics.
##### Hints
Discuss bias-variance trade-off, L1 geometry, validation splits, temporal leakage checks, and write clean, vectorized Python.
Quick Answer: This interview question evaluates core ML concepts, assumptions, math intuition, training/evaluation trade-offs, and practical failure modes in a realistic interview setting. A strong answer for Detect Data Leakage in Supervised Learning Pipelines states assumptions, handles edge cases, explains trade-offs, and shows how to validate the result clearly.