Train LinearSVC to beat a hidden baseline
Company: DRW
Role: Machine Learning Engineer
Category: ML System Design
Difficulty: hard
Interview Round: Take-home Project
##### Question
You are given a dataset and a fixed model class: `LinearSVC`. Implement `train(X_train, y_train)` and `test(X_test)` so that the model's accuracy on a held-out (hidden) test set beats a provided baseline accuracy. The model class is fixed — you may only modify preprocessing, feature engineering, and training hyperparameters; you may not switch to a different estimator.
Address the following:
1. **Implement `train()` and `test()`.** `train(X_train, y_train)` fits the pipeline; `test(X_test)` returns predictions (and a way to report accuracy) on data it has never seen. Keep the final classifier strictly `LinearSVC`.
2. **Propose and justify data-centric improvements.** Experiment with adjustments and explain why each helps a linear large-margin model — e.g. standardization/normalization, robust scaling, outlier handling, TF–IDF or hashing for text, tokenization/n-grams, one-hot or frequency encoding for categoricals, dimensionality reduction / feature selection, feature crosses, deduplication and label-noise cleanup, and class-imbalance strategies (`class_weight='balanced'`, threshold tuning on margins). Note which side — data tweaks vs. model tweaks — moved accuracy more.
3. **Handle mixed-type data.** Build one pipeline that correctly routes numeric, categorical, and free-text columns.
4. **Tune without peeking at the test set.** Describe a robust validation strategy (stratified k-fold, nested CV, or a held-out set) that lets you tune and decide you've beaten the baseline using only the training data, then freeze the configuration. Prevent data leakage by fitting all preprocessing inside the training folds only.
5. **Make it reproducible and measurable.** Provide reproducible code (fixed seeds, saved artifacts), an experiment log, and a plan for estimating generalization (CV mean ± std, confidence intervals, robustness checks).
Quick Answer: Train LinearSVC to beat a hidden baseline evaluates ML product requirements, data/labeling, modeling, serving architecture, evaluation, monitoring, and trade-offs in a realistic interview setting. A strong answer states assumptions, handles edge cases, explains trade-offs, and shows how to validate the result clearly.