Build a Reproducible End-to-End Binary Classification Pipeline for a CSV Dataset
Context
You are given a single CSV file that fits in memory. The dataset contains:
-
Mixed numeric and categorical feature columns
-
A binary label/target column
-
Optional ID columns to drop
Your task is to implement a modular, reproducible, and command-line runnable training pipeline that prevents data leakage and produces a test-set report. Use Python with common ML tooling.
Requirements
-
Load data from CSV
-
Validate basic schema (label present and binary, dtypes make sense, optional ID columns dropped)
-
Handle missing values appropriately for numeric and categorical features
-
Split into train/validation/test with stratification on the label
-
Encode categorical variables and scale/normalize numeric features
-
Train a reasonable baseline classifier
-
Choose an appropriate loss and optimizer (justify selection)
-
Track metrics (at least AUC and accuracy)
-
Prevent leakage by ensuring preprocessing is fit only on training folds within a pipeline
-
Evaluate on validation set
-
Perform simple hyperparameter tuning (e.g., grid/random search)
-
Retrain best model on train+validation and generate a test-set report
-
Make the pipeline reproducible (random seeds, configuration)
-
Structure the code to be modular and runnable from the command line
-
Explain how you would detect and fix common bugs (e.g., optimizer not stepping) and how to monitor training and overfitting
Assumptions
-
The CSV fits in memory
-
The label column name is provided via CLI
-
If the label is not numeric 0/1, you may encode it to 0/1
-
Use Python, pandas, scikit-learn; keep dependencies minimal