Take‑Home: Two End‑to‑End ML Workflows on Tabular Data
Objective
Design and implement two complete machine learning workflows on tabular data (typical of common Kaggle datasets):
-
Regression: predict a continuous target.
-
Classification: predict a binary or multiclass label.
Assume you have a generic CSV dataset with a mix of numeric and categorical features and a clear target column. If the data are time‑ordered, note time‑series‑specific caveats.
Requirements (for each task)
-
Data cleaning and preprocessing
-
Detect and handle missing values.
-
Detect and handle outliers.
-
Encode categorical features appropriately.
-
Scale features where appropriate.
-
Train/validation/test protocol
-
Random shuffling and splits that avoid leakage.
-
If time‑series or grouped data, use proper split strategies (e.g., forward chaining, GroupKFold).
-
Models
-
A simple baseline (e.g., dummy predictor or regularized linear model).
-
At least one stronger model (e.g., tree‑based, boosted trees).
-
Evaluation
-
Regression: RMSE/MAE (and why).
-
Classification: accuracy, ROC‑AUC, F1 (and why). Use PR‑AUC for heavy class imbalance.
-
Model selection
-
Cross‑validation strategy and hyperparameter tuning.
-
Reproducibility
-
Random seeds, environment pinning, data versioning. Persist splits, models, and configs.
-
Interpretability and reliability
-
Feature importance and partial dependence (or SHAP if available).
-
Calibration checks for classification.
-
Deliverables
-
Pseudocode or code‑level steps for both workflows.
-
Discussion of expected pitfalls and how you would debug underperformance.