Build a baseline linear regression pipeline

Q: Build a baseline linear regression pipeline

This is a Machine Learning interview question from Citadel for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

Task: Baseline Linear Regression Pipeline (Python)

Context

You are given a tabular dataset in a pandas DataFrame df. The goal is to predict a continuous target column target. Build a leakage-safe baseline linear regression pipeline and report performance and coefficients.

Requirements

Split the data into train/validation sets (e.g., 80/20) using a fixed random seed.
Avoid data leakage: fit imputers/encoders/scalers only on training data by using a single sklearn Pipeline.
Preprocess features:
- Impute missing values: median for numeric, most_frequent for categorical.
- One-hot encode categorical features (drop one level to avoid multicollinearity; ignore unknown categories at validation).
- Scale numeric features (standardization).
Train an ordinary least squares linear regression model.
Evaluate on the validation set using RMSE and R^2.
Output:
- Fitted coefficients with feature names (and intercept).
- Key metrics (RMSE, R^2).
- A brief interpretation of results.

Assumptions

DataFrame: df with target column named target; all remaining columns are features.

Deliverable

Provide clean, runnable Python code plus a short interpretation of the metrics and coefficients.