How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a medium difficulty Machine Learning question, commonly asked during Onsite rounds at Coinbase.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Coinbase during technical interviews.

Build a baseline classification model from messy data

Quick Overview

This question evaluates skills in practical machine learning engineering, including data cleaning, preprocessing, feature selection, handling mixed numeric and categorical features, and baseline model construction for a binary classification task.

In a live notebook (e.g., Jupyter), you are given a messy, real-world tabular dataset for a binary classification problem.

Data characteristics

Target label: y ∈ {0,1}
Mix of numeric and categorical features
Missing values, inconsistent strings (e.g., "NA", empty), and possible outliers
Some columns may be identifiers (e.g., user_id , transaction_id ) and should not be used as predictive features
Dataset is “medium-sized” (fits in memory); you can train a simple model quickly

Task Within the session, produce a working end-to-end baseline that:

Loads the data and performs minimal but correct cleaning.
Splits data into train/validation (and optionally test) without leakage.
Builds a simple model that can handle mixed feature types (or uses preprocessing to enable this).
Evaluates performance with an appropriate metric (e.g., ROC-AUC / PR-AUC / F1, depending on class imbalance).
Briefly explains your choices (feature selection, preprocessing, model choice, and how you’d improve it if given more time).

You may choose only a few features if that helps you deliver a robust, working solution quickly.

Quick Overview

In a live notebook (e.g., Jupyter), you are given a messy, real-world tabular dataset for a binary classification problem.

Data characteristics

Target label: y ∈ {0,1}
Mix of numeric and categorical features
Missing values, inconsistent strings (e.g., "NA", empty), and possible outliers
Some columns may be identifiers (e.g., user_id , transaction_id ) and should not be used as predictive features
Dataset is “medium-sized” (fits in memory); you can train a simple model quickly

Task Within the session, produce a working end-to-end baseline that:

Loads the data and performs minimal but correct cleaning.
Splits data into train/validation (and optionally test) without leakage.
Builds a simple model that can handle mixed feature types (or uses preprocessing to enable this).
Evaluates performance with an appropriate metric (e.g., ROC-AUC / PR-AUC / F1, depending on class imbalance).
Briefly explains your choices (feature selection, preprocessing, model choice, and how you’d improve it if given more time).

You may choose only a few features if that helps you deliver a robust, working solution quickly.

Build a baseline classification model from messy data

Quick Overview

Solution

Submit Your Answer to Earn 20XP

Build a baseline classification model from messy data

Quick Overview

Solution

Submit Your Answer to Earn 20XP