Detect Data Leakage in Supervised Learning Pipelines

Q: Detect Data Leakage in Supervised Learning Pipelines

This question evaluates a candidate's competency in detecting and preventing data leakage in supervised learning pipelines, understanding bias–variance decomposition, recognizing regularization effects on sparsity, and implementing a temporally correct logistic regression with held‑out AUC evaluation.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

ML Take‑home: Bias–Variance, Regularization, Leakage, and From‑scratch Logistic Regression

Context

You are given user event logs in a Pandas dataframe df with columns:

user_id: unique user identifier
event_time: timestamp of the event
event_type: categorical event name (e.g., view, click, add_to_cart, purchase)
purchase: indicator (0/1) if the event is a purchase

Your goal is to build a leakage‑free binary classifier that predicts whether a user will purchase within the next 7 days, then evaluate AUC on a held‑out set.

Tasks

Bias–variance decomposition
- State the bias and variance terms and interpret them in the bias–variance decomposition.
Regularization and sparsity
- Which regularization technique(s) can shrink linear‑model coefficients exactly to zero, and why?
Detecting data leakage
- Name two practical approaches for detecting data leakage in a supervised learning pipeline.
Modeling: Logistic regression from scratch
- Using df(user_id, event_time, event_type, purchase), build a binary classifier to predict whether a user will purchase within the next 7 days.
- Use a temporally correct split and report AUC on a held‑out set.
- Implement logistic regression with gradient descent using only numpy for the model (pandas allowed for data prep). Provide basic convergence diagnostics.

Implementation requirements

Ensure no temporal leakage: features must use data up to an anchor time; labels look forward 7 days after the anchor.
Clean, vectorized Python; no sklearn for the model or metrics (implement AUC yourself).

Detect Data Leakage in Supervised Learning Pipelines

ML Take‑home: Bias–Variance, Regularization, Leakage, and From‑scratch Logistic Regression

Context

Tasks

Implementation requirements

Solution

Comments (0)

Detect Data Leakage in Supervised Learning Pipelines

Overview

ML Take‑home: Bias–Variance, Regularization, Leakage, and From‑scratch Logistic Regression

Context

Tasks

Implementation requirements

Solution

Comments (0)