Practical ML fundamentals (Python + scikit-learn)
You are given a small toy binary-classification dataset (e.g., arrays/dataframes X_train, y_train, X_valid, y_valid or a single dataset you must split). Your task is to:
-
Train a baseline binary classifier
using
scikit-learn
.
-
Choose a reasonable model (e.g., logistic regression, linear SVM, random forest, gradient boosting).
-
Fit it on the training set.
-
Evaluate the model
on the validation set using one or more evaluation metrics.
-
Common choices: accuracy, precision/recall, F1, ROC-AUC, PR-AUC, confusion matrix.
-
After you see the initial metric(s),
improve the evaluation metric(s)
.
-
You may change the model, tune hyperparameters, adjust preprocessing, address class imbalance, change decision thresholds, or revise the validation approach.
Constraints / expectations
-
Use
Python
and
scikit-learn
APIs.
-
Keep the solution clean and reproducible (e.g., use
Pipeline
, set
random_state
, avoid data leakage).
-
Explain your choices and how each change is expected to affect the metric.