Take‑Home: Build a strong LinearSVC pipeline that beats a baseline and generalizes
Problem
You are given training features X_train and labels y_train and must implement:
-
train(X_train, y_train): trains a LinearSVC-based pipeline that beats a provided baseline accuracy using only preprocessing, feature engineering, and hyperparameters. It must be reproducible and robust to mixed data types (numeric, categorical, text).
-
test(X_test): loads the trained artifact and produces predictions for X_test without any data leakage.
Constraints:
-
The model class is fixed: LinearSVC. You may not change the final classifier family.
-
You may change preprocessing and feature engineering (e.g., standardization/normalization, tokenization + TF–IDF or hashing for text, dimensionality reduction, outlier handling, class imbalance strategies, feature crosses, data cleaning).
-
You must explain and implement a validation strategy that tunes hyperparameters without using test accuracy after the baseline is exceeded (e.g., k‑fold CV, nested CV, or a holdout set) while preventing data leakage.
Deliverables:
-
Reproducible code for train(X_train, y_train) and test(X_test) using LinearSVC, with robust preprocessing for typical tabular/text tasks.
-
Justification of data‑centric improvements you applied.
-
A tuning plan that does not consult test accuracy once baseline is exceeded, with a clear guard against leakage.
-
An experiment log that records baseline, CV scores, chosen hyperparameters, and final training details.
-
A plan for measuring out‑of‑sample generalization.
Assumptions (for completeness):
-
X_train/X_test are pandas DataFrames with a mix of numeric, categorical (string/low‑cardinality), and possibly text columns (free‑form strings). y_train is a 1D array-like of class labels (binary or multi‑class). If your dataset is purely tabular or purely text, the pipeline adapts by auto‑detecting column types.