ML Take‑home: Bias–Variance, Regularization, Leakage, and From‑scratch Logistic Regression
Context
You are given user event logs in a Pandas dataframe df with columns:
-
user_id: unique user identifier
-
event_time: timestamp of the event
-
event_type: categorical event name (e.g., view, click, add_to_cart, purchase)
-
purchase: indicator (0/1) if the event is a purchase
Your goal is to build a leakage‑free binary classifier that predicts whether a user will purchase within the next 7 days, then evaluate AUC on a held‑out set.
Tasks
-
Bias–variance decomposition
-
State the bias and variance terms and interpret them in the bias–variance decomposition.
-
Regularization and sparsity
-
Which regularization technique(s) can shrink linear‑model coefficients exactly to zero, and why?
-
Detecting data leakage
-
Name two practical approaches for detecting data leakage in a supervised learning pipeline.
-
Modeling: Logistic regression from scratch
-
Using df(user_id, event_time, event_type, purchase), build a binary classifier to predict whether a user will purchase within the next 7 days.
-
Use a temporally correct split and report AUC on a held‑out set.
-
Implement logistic regression with gradient descent using only numpy for the model (pandas allowed for data prep). Provide basic convergence diagnostics.
Implementation requirements
-
Ensure no temporal leakage: features must use data up to an anchor time; labels look forward 7 days after the anchor.
-
Clean, vectorized Python; no sklearn for the model or metrics (implement AUC yourself).