PracHub
QuestionsPremiumLearningGuidesInterview PrepCoaches
|Home/Machine Learning/Citadel

Build a baseline linear regression pipeline

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a candidate's ability to construct a leakage-safe preprocessing and modeling pipeline for tabular data, including imputation, categorical encoding, feature scaling, ordinary least squares linear regression, and interpretation of coefficients and performance metrics.

  • medium
  • Citadel
  • Machine Learning
  • Data Scientist

Build a baseline linear regression pipeline

Company: Citadel

Role: Data Scientist

Category: Machine Learning

Difficulty: medium

Interview Round: Technical Screen

Build a baseline linear regression pipeline in Python without code-completion tools: split the data into train/validation sets with a fixed random seed, impute missing values, one-hot encode categorical features, scale numeric features, train an ordinary least squares model, and evaluate it using RMSE and R^2 on the validation set while avoiding data leakage (fit transformers only on training data). Output the fitted coefficients, key metrics, and a brief interpretation of results.

Quick Answer: This question evaluates a candidate's ability to construct a leakage-safe preprocessing and modeling pipeline for tabular data, including imputation, categorical encoding, feature scaling, ordinary least squares linear regression, and interpretation of coefficients and performance metrics.

Related Interview Questions

  • Analyze Correlations and Generate Gaussians - Citadel (medium)
  • Determine When a Quadratic Has Finite Minimum - Citadel (medium)
  • Choose models for trading tasks - Citadel (hard)
  • Estimate OLS via streaming sufficient statistics - Citadel (hard)
  • Design city home-price prediction system - Citadel (hard)
Citadel logo
Citadel
Aug 8, 2025, 12:00 AM
Data Scientist
Technical Screen
Machine Learning
5
0

Task: Baseline Linear Regression Pipeline (Python)

Context

You are given a tabular dataset in a pandas DataFrame df. The goal is to predict a continuous target column target. Build a leakage-safe baseline linear regression pipeline and report performance and coefficients.

Requirements

  1. Split the data into train/validation sets (e.g., 80/20) using a fixed random seed.
  2. Avoid data leakage: fit imputers/encoders/scalers only on training data by using a single sklearn Pipeline.
  3. Preprocess features:
    • Impute missing values: median for numeric, most_frequent for categorical.
    • One-hot encode categorical features (drop one level to avoid multicollinearity; ignore unknown categories at validation).
    • Scale numeric features (standardization).
  4. Train an ordinary least squares linear regression model.
  5. Evaluate on the validation set using RMSE and R^2.
  6. Output:
    • Fitted coefficients with feature names (and intercept).
    • Key metrics (RMSE, R^2).
    • A brief interpretation of results.

Assumptions

  • DataFrame: df with target column named target; all remaining columns are features.

Deliverable

Provide clean, runnable Python code plus a short interpretation of the metrics and coefficients.

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Machine Learning•More Citadel•More Data Scientist•Citadel Data Scientist•Citadel Machine Learning•Data Scientist Machine Learning
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.