Build a baseline classification model from messy data
Company: Coinbase
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: medium
Interview Round: Onsite
In a live notebook (e.g., Jupyter), you are given a messy, real-world tabular dataset for a **binary classification** problem.
**Data characteristics**
- Target label: `y` ∈ {0,1}
- Mix of numeric and categorical features
- Missing values, inconsistent strings (e.g., "NA", empty), and possible outliers
- Some columns may be identifiers (e.g., `user_id`, `transaction_id`) and should not be used as predictive features
- Dataset is “medium-sized” (fits in memory); you can train a simple model quickly
**Task**
Within the session, produce a working end-to-end baseline that:
1. Loads the data and performs minimal but correct cleaning.
2. Splits data into train/validation (and optionally test) without leakage.
3. Builds a simple model that can handle mixed feature types (or uses preprocessing to enable this).
4. Evaluates performance with an appropriate metric (e.g., ROC-AUC / PR-AUC / F1, depending on class imbalance).
5. Briefly explains your choices (feature selection, preprocessing, model choice, and how you’d improve it if given more time).
You may choose only a few features if that helps you deliver a robust, working solution quickly.
Quick Answer: This question evaluates skills in practical machine learning engineering, including data cleaning, preprocessing, feature selection, handling mixed numeric and categorical features, and baseline model construction for a binary classification task.