Task: Train and Evaluate a LinearSVC to Beat a Baseline
Context
You are given a binary or multi-class classification dataset split into train and hidden test sets. An accuracy baseline (e.g., a previous model or a heuristic) is provided. Your goal is to implement train() and test() functions to:
-
Train a LinearSVC model that beats the supplied baseline on the hidden test set.
-
Use sound validation (e.g., cross-validation) without peeking at the hidden test.
-
Experiment with both model-level and data-level adjustments; document what you tried and which changes helped. Emphasize data-side tweaks where effective.
Assume input features may be numeric (dense or sparse) and/or text. You may implement a flexible preprocessing pipeline or parameterize your functions to handle either case.
Requirements
-
Implement
train()
that:
-
Accepts training data
(X_train, y_train)
and a
baseline_accuracy
.
-
Builds a scikit-learn Pipeline with preprocessing and a
LinearSVC
classifier.
-
Tunes key hyperparameters via cross-validation (report CV scores and best params).
-
Returns a fitted pipeline and training diagnostics (e.g., CV accuracy).
-
Implement
test()
that:
-
Takes the fitted pipeline and
(X_test, y_test)
.
-
Returns test accuracy (and a brief report) to confirm you beat the baseline.
-
Experiments and reporting:
-
Try model-level tweaks (e.g.,
C
,
class_weight
, feature selection) and data-side tweaks (e.g., scaling, TF–IDF, stop-word removal, deduplication).
-
Briefly summarize which changes improved accuracy. Note that data-side tweaks were especially effective.
Constraints
-
Classifier must be
sklearn.svm.LinearSVC
.
-
Use accuracy as the metric.
-
Avoid data leakage: fit all preprocessing only on training folds.
-
Use cross-validation on train; use the hidden test only once at the end.