End-to-End Tabular Classification Pipeline (Python)
Context
You are given a tabular dataset in a CSV file and asked to build an end-to-end machine learning pipeline for a classification problem. Assume the dataset contains a column named target (binary classification by default). You may extend to multiclass if desired.
Requirements
-
Load the data from CSV.
-
Create stratified train/validation/test splits (e.g., 60/20/20).
-
Handle missing values and encode categorical features.
-
Standardize numeric features.
-
Train a simple baseline model (e.g., Logistic Regression) and at least one stronger model (e.g., Gradient Boosting or a small neural network).
-
Tune key hyperparameters with cross-validation.
-
Report accuracy, precision, recall, and ROC-AUC on validation and test sets.
-
Persist the trained model and preprocessing steps.
-
Implement batch inference via a
predict(input_csv_path, output_csv_path)
function or CLI.
If you choose a neural network, include a correct training loop with optimizer initialization, forward pass, loss computation, backward pass, and optimizer step.
Deliverables
-
Clear, well-structured Python code (preferably using scikit-learn for classical models) with docstrings/comments.
-
A short explanation of design choices and how you would productionize this pipeline.