Design regression and classification ML pipelines

Q: Design regression and classification ML pipelines

This is a Machine Learning interview question from Citadel for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

Take‑Home: Two End‑to‑End ML Workflows on Tabular Data

Objective

Design and implement two complete machine learning workflows on tabular data (typical of common Kaggle datasets):

Regression: predict a continuous target.
Classification: predict a binary or multiclass label.

Assume you have a generic CSV dataset with a mix of numeric and categorical features and a clear target column. If the data are time‑ordered, note time‑series‑specific caveats.

Requirements (for each task)

Data cleaning and preprocessing
- Detect and handle missing values.
- Detect and handle outliers.
- Encode categorical features appropriately.
- Scale features where appropriate.
Train/validation/test protocol
- Random shuffling and splits that avoid leakage.
- If time‑series or grouped data, use proper split strategies (e.g., forward chaining, GroupKFold).
Models
- A simple baseline (e.g., dummy predictor or regularized linear model).
- At least one stronger model (e.g., tree‑based, boosted trees).
Evaluation
- Regression: RMSE/MAE (and why).
- Classification: accuracy, ROC‑AUC, F1 (and why). Use PR‑AUC for heavy class imbalance.
Model selection
- Cross‑validation strategy and hyperparameter tuning.
Reproducibility
- Random seeds, environment pinning, data versioning. Persist splits, models, and configs.
Interpretability and reliability
- Feature importance and partial dependence (or SHAP if available).
- Calibration checks for classification.
Deliverables
- Pseudocode or code‑level steps for both workflows.
- Discussion of expected pitfalls and how you would debug underperformance.

Design regression and classification ML pipelines

Take‑Home: Two End‑to‑End ML Workflows on Tabular Data

Objective

Requirements (for each task)

Solution (Locked)

Comments (0)