Train LinearSVC to beat a hidden baseline

Q: Train LinearSVC to beat a hidden baseline

This is a ML System Design interview question from DRW for Machine Learning Engineer roles. View the full question and solution on PracHub.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

Take‑Home: Build a strong LinearSVC pipeline that beats a baseline and generalizes

Problem

You are given training features X_train and labels y_train and must implement:

train(X_train, y_train): trains a LinearSVC-based pipeline that beats a provided baseline accuracy using only preprocessing, feature engineering, and hyperparameters. It must be reproducible and robust to mixed data types (numeric, categorical, text).
test(X_test): loads the trained artifact and produces predictions for X_test without any data leakage.

Constraints:

The model class is fixed: LinearSVC. You may not change the final classifier family.
You may change preprocessing and feature engineering (e.g., standardization/normalization, tokenization + TF–IDF or hashing for text, dimensionality reduction, outlier handling, class imbalance strategies, feature crosses, data cleaning).
You must explain and implement a validation strategy that tunes hyperparameters without using test accuracy after the baseline is exceeded (e.g., k‑fold CV, nested CV, or a holdout set) while preventing data leakage.

Deliverables:

Reproducible code for train(X_train, y_train) and test(X_test) using LinearSVC, with robust preprocessing for typical tabular/text tasks.
Justification of data‑centric improvements you applied.
A tuning plan that does not consult test accuracy once baseline is exceeded, with a clear guard against leakage.
An experiment log that records baseline, CV scores, chosen hyperparameters, and final training details.
A plan for measuring out‑of‑sample generalization.

Assumptions (for completeness):

X_train/X_test are pandas DataFrames with a mix of numeric, categorical (string/low‑cardinality), and possibly text columns (free‑form strings). y_train is a 1D array-like of class labels (binary or multi‑class). If your dataset is purely tabular or purely text, the pipeline adapts by auto‑detecting column types.

Train LinearSVC to beat a hidden baseline

Take‑Home: Build a strong LinearSVC pipeline that beats a baseline and generalizes

Problem

Solution (Locked)

Comments (0)