Detect Data Leakage in Supervised Learning Pipelines
Company: Boston Consulting Group
Role: Data Scientist
Category: Machine Learning
Difficulty: hard
Interview Round: Take-home Project
##### Scenario
Company screens ML engineers with a 90-minute CodeSignal test containing conceptual MCQs and Python modeling tasks.
##### Question
State and interpret the bias and variance terms in the bias–variance decomposition. Which regularization technique(s) can shrink linear-model coefficients exactly to zero and why? Name two practical approaches for detecting data leakage in a supervised learning pipeline. Given dataframe df(user_id, event_time, event_type, purchase), build a binary classifier predicting whether a user will purchase within the next 7 days and report AUC on a held-out set. Implement logistic regression with gradient descent using only numpy; provide convergence diagnostics.
##### Hints
Discuss bias-variance trade-off, L1 geometry, validation splits, temporal leakage checks, and write clean, vectorized Python.
Quick Answer: This question evaluates a candidate's competency in detecting and preventing data leakage in supervised learning pipelines, understanding bias–variance decomposition, recognizing regularization effects on sparsity, and implementing a temporally correct logistic regression with held‑out AUC evaluation.