Train a classifier and analyze dataset

Q: Train a classifier and analyze dataset

This is a Machine Learning interview question from OpenAI for Machine Learning Engineer roles. View the full question and solution on PracHub.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

End-to-End Binary Classifier Workflow (EDA → Modeling → Fairness → Report)

You are given a labeled tabular dataset and asked to implement a reproducible, end-to-end workflow in Python to analyze the data and train a classifier suitable for deployment.

Assumptions (adapt as needed):

Input: a CSV file with a binary target column (e.g., target ∈ {0,1}).
Optional columns: a timestamp column for time-based splits; group columns for fairness checks; an ID column to drop.
Output: code, metrics, saved model artifact, and a concise text report with a recommended model and expected performance.

Requirements:

Data access and schema validation
- Load data; verify required columns exist; basic type checks and duplicate rows/IDs.
- Summarize numeric/categorical feature counts and missingness.
Exploratory data analysis (EDA)
- Missing values: counts, percentages, imputation plan.
- Target leakage checks: suspicious feature names, extremely high target correlation/MI.
- Class imbalance: distribution and imbalance ratio.
- Feature distributions: univariate summaries (hist/value counts) and basic outlier flags.
Splitting strategy
- If timestamp present: time-based split (train/validation/test by chronological order).
- Else: stratified split to preserve class ratio.
Baselines
- Majority-class and simple model baseline (e.g., Logistic Regression with minimal tuning).
Preprocessing
- Numeric: impute (median), scale (standard).
- Categorical: impute (most frequent), one-hot encode (handle_unknown=ignore); consider rare-category handling.
Imbalance handling
- Use class weights and/or sample weighting; optionally resampling (SMOTE/undersampling) if justified.
Model training
- Train at least two model families (e.g., Logistic Regression and Gradient Boosting).
- Use cross-validation with hyperparameter tuning (RandomizedSearchCV or equivalent).
Metrics aligned to business goal
- Compute ROC-AUC and PR-AUC; report F1/precision/recall at a chosen threshold.
- If provided, include cost-sensitive evaluation using FP/FN costs.
Calibration (if needed)
- Assess calibration; calibrate probabilities (Platt or isotonic) if poorly calibrated.
Error analysis
- Confusion matrix and per-slice analysis (e.g., by key categorical/numeric bins).
- Feature importances (tree-based) and/or coefficients (linear); optionally SHAP.
Fairness checks
- Report metrics by key groups; highlight disparities (e.g., demographic parity, equal opportunity).
Reproducibility and report
- Save the fitted pipeline, metrics JSON, and environment info.
- Produce a concise recommendation: chosen model, expected performance, and deployment notes.

Deliverables:

Python code implementing the above.
Saved model artifact and metrics.
Short written recommendation with expected performance and guardrails.

Train a classifier and analyze dataset

End-to-End Binary Classifier Workflow (EDA → Modeling → Fairness → Report)

Solution (Locked)

Comments (0)