In a live notebook (e.g., Jupyter), you are given a messy, real-world tabular dataset for a binary classification problem.
Data characteristics
-
Target label:
y
∈ {0,1}
-
Mix of numeric and categorical features
-
Missing values, inconsistent strings (e.g., "NA", empty), and possible outliers
-
Some columns may be identifiers (e.g.,
user_id
,
transaction_id
) and should not be used as predictive features
-
Dataset is “medium-sized” (fits in memory); you can train a simple model quickly
Task
Within the session, produce a working end-to-end baseline that:
-
Loads the data and performs minimal but correct cleaning.
-
Splits data into train/validation (and optionally test) without leakage.
-
Builds a simple model that can handle mixed feature types (or uses preprocessing to enable this).
-
Evaluates performance with an appropriate metric (e.g., ROC-AUC / PR-AUC / F1, depending on class imbalance).
-
Briefly explains your choices (feature selection, preprocessing, model choice, and how you’d improve it if given more time).
You may choose only a few features if that helps you deliver a robust, working solution quickly.