PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/ML System Design/DRW

Train LinearSVC to beat a hidden baseline

Last updated: Jun 15, 2026

Quick Overview

Train LinearSVC to beat a hidden baseline evaluates ML product requirements, data/labeling, modeling, serving architecture, evaluation, monitoring, and trade-offs in a realistic interview setting. A strong answer states assumptions, handles edge cases, explains trade-offs, and shows how to validate the result clearly.

  • hard
  • DRW
  • ML System Design
  • Machine Learning Engineer

Train LinearSVC to beat a hidden baseline

Company: DRW

Role: Machine Learning Engineer

Category: ML System Design

Difficulty: hard

Interview Round: Take-home Project

##### Question You are given a dataset and a fixed model class: `LinearSVC`. Implement `train(X_train, y_train)` and `test(X_test)` so that the model's accuracy on a held-out (hidden) test set beats a provided baseline accuracy. The model class is fixed — you may only modify preprocessing, feature engineering, and training hyperparameters; you may not switch to a different estimator. Address the following: 1. **Implement `train()` and `test()`.** `train(X_train, y_train)` fits the pipeline; `test(X_test)` returns predictions (and a way to report accuracy) on data it has never seen. Keep the final classifier strictly `LinearSVC`. 2. **Propose and justify data-centric improvements.** Experiment with adjustments and explain why each helps a linear large-margin model — e.g. standardization/normalization, robust scaling, outlier handling, TF–IDF or hashing for text, tokenization/n-grams, one-hot or frequency encoding for categoricals, dimensionality reduction / feature selection, feature crosses, deduplication and label-noise cleanup, and class-imbalance strategies (`class_weight='balanced'`, threshold tuning on margins). Note which side — data tweaks vs. model tweaks — moved accuracy more. 3. **Handle mixed-type data.** Build one pipeline that correctly routes numeric, categorical, and free-text columns. 4. **Tune without peeking at the test set.** Describe a robust validation strategy (stratified k-fold, nested CV, or a held-out set) that lets you tune and decide you've beaten the baseline using only the training data, then freeze the configuration. Prevent data leakage by fitting all preprocessing inside the training folds only. 5. **Make it reproducible and measurable.** Provide reproducible code (fixed seeds, saved artifacts), an experiment log, and a plan for estimating generalization (CV mean ± std, confidence intervals, robustness checks).

Quick Answer: Train LinearSVC to beat a hidden baseline evaluates ML product requirements, data/labeling, modeling, serving architecture, evaluation, monitoring, and trade-offs in a realistic interview setting. A strong answer states assumptions, handles edge cases, explains trade-offs, and shows how to validate the result clearly.

Related Interview Questions

  • Build pipeline for imbalanced classification - DRW (medium)
|Home/ML System Design/DRW

Train LinearSVC to beat a hidden baseline

DRW logo
DRW
Jul 29, 2025, 12:00 AM
hardMachine Learning EngineerTake-home ProjectML System Design
2
0

Train LinearSVC to beat a hidden baseline

You are given a dataset and a fixed model class: LinearSVC. Implement train(X_train, y_train) and test(X_test) so that the model's accuracy on a held-out (hidden) test set beats a provided baseline accuracy. The model class is fixed — you may only modify preprocessing, feature engineering, and training hyperparameters; you may not switch to a different estimator.

Address the following:

  1. Implement train() and test(). train(X_train, y_train) fits the pipeline; test(X_test) returns predictions (and a way to report accuracy) on data it has never seen. Keep the final classifier strictly LinearSVC .
  2. Propose and justify data-centric improvements. Experiment with adjustments and explain why each helps a linear large-margin model — e.g. standardization/normalization, robust scaling, outlier handling, TF–IDF or hashing for text, tokenization/n-grams, one-hot or frequency encoding for categoricals, dimensionality reduction / feature selection, feature crosses, deduplication and label-noise cleanup, and class-imbalance strategies ( class_weight='balanced' , threshold tuning on margins). Note which side — data tweaks vs. model tweaks — moved accuracy more.
  3. Handle mixed-type data. Build one pipeline that correctly routes numeric, categorical, and free-text columns.
  4. Tune without peeking at the test set. Describe a robust validation strategy (stratified k-fold, nested CV, or a held-out set) that lets you tune and decide you've beaten the baseline using only the training data, then freeze the configuration. Prevent data leakage by fitting all preprocessing inside the training folds only.
  5. Make it reproducible and measurable. Provide reproducible code (fixed seeds, saved artifacts), an experiment log, and a plan for estimating generalization (CV mean ± std, confidence intervals, robustness checks).

Constraints & Assumptions

  • Preserve the scope, facts, inputs, and requested outputs from the prompt above.
  • If the prompt leaves a detail unspecified, state a reasonable assumption before relying on it.
  • Keep the answer interview-ready: concise enough to present, but concrete enough to implement or evaluate.

Clarifying Questions to Ask

  • Clarify users, core use cases, read/write patterns, scale, latency, availability, and data retention.
  • State explicit assumptions before making sizing or architecture decisions.
  • Prioritize the functional path first, then address reliability, security, observability, and rollout.

What a Strong Answer Covers

  • A scoped requirements summary with concrete non-goals and success metrics.
  • ML-specific data, model, evaluation, serving, and monitoring choices.
  • Reasoned trade-offs among simple and scalable designs, including bottlenecks and failure modes.
  • A validation, monitoring, migration, and launch plan appropriate for the risk level.

Follow-up Questions

  • What breaks first at 10x traffic or data volume?
  • How would you degrade gracefully during dependency failures?
  • What metrics and alerts would prove the design is healthy after launch?

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More DRW•More Machine Learning Engineer•DRW Machine Learning Engineer•DRW ML System Design•Machine Learning Engineer ML System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.