PracHub
QuestionsPremiumLearningGuidesInterview PrepCoaches
|Home/Machine Learning/Citadel

Design regression and classification ML pipelines

Last updated: Mar 29, 2026

Quick Overview

This task evaluates proficiency in designing and implementing end-to-end machine learning pipelines for tabular regression and classification, encompassing data cleaning, feature engineering, model selection, evaluation metrics, reproducibility, and interpretability.

  • hard
  • Citadel
  • Machine Learning
  • Data Scientist

Design regression and classification ML pipelines

Company: Citadel

Role: Data Scientist

Category: Machine Learning

Difficulty: hard

Interview Round: Take-home Project

Design and implement two end-to-end machine learning workflows on tabular data similar to common Kaggle datasets: ( 1) a regression task predicting a continuous target, and ( 2) a classification task predicting a binary or multiclass label. For each task, describe and execute: data cleaning (detect/handle missing values and outliers; encode categorical features; scale where appropriate); random shuffling with proper train/validation/test splits that avoid leakage (note time-series caveats if applicable); selection of a simple baseline and at least one stronger model (e.g., regularized linear models, tree-based methods); evaluation metrics (e.g., RMSE/MAE for regression; accuracy/ROC-AUC/F1 for classification) and why they fit the objective; cross-validation and hyperparameter tuning strategy; steps to ensure reproducibility (seeds, environment, data versioning) and interpretability (feature importance, partial dependence, calibration). Provide pseudocode or code-level steps and discuss expected pitfalls and how you would debug underperformance.

Quick Answer: This task evaluates proficiency in designing and implementing end-to-end machine learning pipelines for tabular regression and classification, encompassing data cleaning, feature engineering, model selection, evaluation metrics, reproducibility, and interpretability.

Related Interview Questions

  • Analyze Correlations and Generate Gaussians - Citadel (medium)
  • Determine When a Quadratic Has Finite Minimum - Citadel (medium)
  • Choose models for trading tasks - Citadel (hard)
  • Estimate OLS via streaming sufficient statistics - Citadel (hard)
  • Design city home-price prediction system - Citadel (hard)
Citadel logo
Citadel
Sep 6, 2025, 12:00 AM
Data Scientist
Take-home Project
Machine Learning
7
0

Take‑Home: Two End‑to‑End ML Workflows on Tabular Data

Objective

Design and implement two complete machine learning workflows on tabular data (typical of common Kaggle datasets):

  • Regression: predict a continuous target.
  • Classification: predict a binary or multiclass label.

Assume you have a generic CSV dataset with a mix of numeric and categorical features and a clear target column. If the data are time‑ordered, note time‑series‑specific caveats.

Requirements (for each task)

  1. Data cleaning and preprocessing
    • Detect and handle missing values.
    • Detect and handle outliers.
    • Encode categorical features appropriately.
    • Scale features where appropriate.
  2. Train/validation/test protocol
    • Random shuffling and splits that avoid leakage.
    • If time‑series or grouped data, use proper split strategies (e.g., forward chaining, GroupKFold).
  3. Models
    • A simple baseline (e.g., dummy predictor or regularized linear model).
    • At least one stronger model (e.g., tree‑based, boosted trees).
  4. Evaluation
    • Regression: RMSE/MAE (and why).
    • Classification: accuracy, ROC‑AUC, F1 (and why). Use PR‑AUC for heavy class imbalance.
  5. Model selection
    • Cross‑validation strategy and hyperparameter tuning.
  6. Reproducibility
    • Random seeds, environment pinning, data versioning. Persist splits, models, and configs.
  7. Interpretability and reliability
    • Feature importance and partial dependence (or SHAP if available).
    • Calibration checks for classification.
  8. Deliverables
    • Pseudocode or code‑level steps for both workflows.
    • Discussion of expected pitfalls and how you would debug underperformance.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Citadel•More Data Scientist•Citadel Data Scientist•Citadel Machine Learning•Data Scientist Machine Learning
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.