Design regression and classification ML pipelines
Company: Citadel
Role: Data Scientist
Category: Machine Learning
Difficulty: hard
Interview Round: Take-home Project
Design and implement two end-to-end machine learning workflows on tabular data similar to common Kaggle datasets:
(
1) a regression task predicting a continuous target, and
(
2) a classification task predicting a binary or multiclass label. For each task, describe and execute: data cleaning (detect/handle missing values and outliers; encode categorical features; scale where appropriate); random shuffling with proper train/validation/test splits that avoid leakage (note time-series caveats if applicable); selection of a simple baseline and at least one stronger model (e.g., regularized linear models, tree-based methods); evaluation metrics (e.g., RMSE/MAE for regression; accuracy/ROC-AUC/F1 for classification) and why they fit the objective; cross-validation and hyperparameter tuning strategy; steps to ensure reproducibility (seeds, environment, data versioning) and interpretability (feature importance, partial dependence, calibration). Provide pseudocode or code-level steps and discuss expected pitfalls and how you would debug underperformance.
Quick Answer: This task evaluates proficiency in designing and implementing end-to-end machine learning pipelines for tabular regression and classification, encompassing data cleaning, feature engineering, model selection, evaluation metrics, reproducibility, and interpretability.