Build a baseline classification model from messy data

Q: Build a baseline classification model from messy data

This is a Machine Learning interview question from Coinbase for Machine Learning Engineer roles. View the full question and solution on PracHub.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

In a live notebook (e.g., Jupyter), you are given a messy, real-world tabular dataset for a binary classification problem.

Data characteristics

Target label: y ∈ {0,1}
Mix of numeric and categorical features
Missing values, inconsistent strings (e.g., "NA", empty), and possible outliers
Some columns may be identifiers (e.g., user_id , transaction_id ) and should not be used as predictive features
Dataset is “medium-sized” (fits in memory); you can train a simple model quickly

Task Within the session, produce a working end-to-end baseline that:

Loads the data and performs minimal but correct cleaning.
Splits data into train/validation (and optionally test) without leakage.
Builds a simple model that can handle mixed feature types (or uses preprocessing to enable this).
Evaluates performance with an appropriate metric (e.g., ROC-AUC / PR-AUC / F1, depending on class imbalance).
Briefly explains your choices (feature selection, preprocessing, model choice, and how you’d improve it if given more time).

You may choose only a few features if that helps you deliver a robust, working solution quickly.

Build a baseline classification model from messy data

Solution

Comments (0)