This question evaluates a machine learning engineer's competence in building and validating a production-style classification pipeline with scikit-learn's LinearSVC, including preprocessing, hyperparameter tuning, cross-validation, and safeguarding against data leakage.
You are given a binary or multi-class classification dataset split into train and hidden test sets. An accuracy baseline (e.g., a previous model or a heuristic) is provided. Your goal is to implement train() and test() functions to:
Assume input features may be numeric (dense or sparse) and/or text. You may implement a flexible preprocessing pipeline or parameterize your functions to handle either case.
train()
that:
(X_train, y_train)
and a
baseline_accuracy
.
LinearSVC
classifier.
test()
that:
(X_test, y_test)
.
C
,
class_weight
, feature selection) and data-side tweaks (e.g., scaling, TF–IDF, stop-word removal, deduplication).
sklearn.svm.LinearSVC
.
Login required