Train LinearSVC to beat baseline accuracy

Q: Train LinearSVC to beat baseline accuracy

This question evaluates a machine learning engineer's competence in building and validating a production-style classification pipeline with scikit-learn's LinearSVC, including preprocessing, hyperparameter tuning, cross-validation, and safeguarding against data leakage.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

Task: Train and Evaluate a LinearSVC to Beat a Baseline

Context

You are given a binary or multi-class classification dataset split into train and hidden test sets. An accuracy baseline (e.g., a previous model or a heuristic) is provided. Your goal is to implement train() and test() functions to:

Train a LinearSVC model that beats the supplied baseline on the hidden test set.
Use sound validation (e.g., cross-validation) without peeking at the hidden test.
Experiment with both model-level and data-level adjustments; document what you tried and which changes helped. Emphasize data-side tweaks where effective.

Assume input features may be numeric (dense or sparse) and/or text. You may implement a flexible preprocessing pipeline or parameterize your functions to handle either case.

Requirements

Implement train() that:
- Accepts training data (X_train, y_train) and a baseline_accuracy .
- Builds a scikit-learn Pipeline with preprocessing and a LinearSVC classifier.
- Tunes key hyperparameters via cross-validation (report CV scores and best params).
- Returns a fitted pipeline and training diagnostics (e.g., CV accuracy).
Implement test() that:
- Takes the fitted pipeline and (X_test, y_test) .
- Returns test accuracy (and a brief report) to confirm you beat the baseline.
Experiments and reporting:
- Try model-level tweaks (e.g., C , class_weight , feature selection) and data-side tweaks (e.g., scaling, TF–IDF, stop-word removal, deduplication).
- Briefly summarize which changes improved accuracy. Note that data-side tweaks were especially effective.

Constraints

Classifier must be sklearn.svm.LinearSVC .
Use accuracy as the metric.
Avoid data leakage: fit all preprocessing only on training folds.
Use cross-validation on train; use the hidden test only once at the end.

Train LinearSVC to beat baseline accuracy

Task: Train and Evaluate a LinearSVC to Beat a Baseline

Context

Requirements

Constraints

Solution

Comments (0)