Train and improve a scikit-learn binary classifier
Company: Perplexity
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: medium
Interview Round: Technical Screen
## Practical ML fundamentals (Python + scikit-learn)
You are given a small **toy binary-classification dataset** (e.g., arrays/dataframes `X_train, y_train, X_valid, y_valid` or a single dataset you must split). Your task is to:
1. **Train a baseline binary classifier** using **scikit-learn**.
- Choose a reasonable model (e.g., logistic regression, linear SVM, random forest, gradient boosting).
- Fit it on the training set.
2. **Evaluate the model** on the validation set using one or more evaluation metrics.
- Common choices: accuracy, precision/recall, F1, ROC-AUC, PR-AUC, confusion matrix.
3. After you see the initial metric(s), **improve the evaluation metric(s)**.
- You may change the model, tune hyperparameters, adjust preprocessing, address class imbalance, change decision thresholds, or revise the validation approach.
### Constraints / expectations
- Use **Python** and **scikit-learn** APIs.
- Keep the solution clean and reproducible (e.g., use `Pipeline`, set `random_state`, avoid data leakage).
- Explain your choices and how each change is expected to affect the metric.
Quick Answer: Evaluates the ability to train, evaluate, and iteratively improve a scikit-learn binary classifier, encompassing model selection, preprocessing, validation practices, handling class imbalance, and interpretation of performance metrics.