End-to-End Tabular Classification Workflow in Google Colab
You are asked to design and implement a complete classification workflow for a tabular dataset in Google Colab.
Include the following:
-
Data loading and basic setup (Colab specifics, package installs, reproducibility seed).
-
Exploratory Data Analysis (EDA): schema, missingness, target distribution, and quick sanity checks.
-
Feature preprocessing: handling missing values, scaling numeric features, encoding categoricals, handling rare categories, and guarding against leakage.
-
Data splitting strategy: train/validation/test with stratification; justify choices (e.g., time-based splits if time features exist).
-
Baselines and model selection: build a naive baseline and a simple linear model; then consider stronger non-linear models. Discuss algorithm trade-offs.
-
Cross-validation and hyperparameter tuning: use an appropriate CV strategy (e.g., StratifiedKFold), choose a scoring metric, and tune hyperparameters.
-
Class imbalance: diagnose and mitigate (class weights, resampling like SMOTE, thresholding strategies). Explain when and why to use each.
-
Evaluation: select and justify metrics (accuracy, precision/recall, F1, ROC-AUC, PR-AUC); show threshold selection for operational goals.
-
Confidence intervals: report uncertainty for key metrics using a sound method (e.g., bootstrap).
-
Leakage prevention: show how your pipeline avoids leakage across preprocessing, resampling, tuning, and evaluation.
-
Interpretation and iteration: interpret model (feature importance, coefficients, permutation importance), perform error analysis, and outline iteration steps.
Provide code or clear pseudocode illustrating the structure and key steps. Explain trade-offs and how you would interpret results and iterate.