Task: Baseline Linear Regression Pipeline (Python)
Context
You are given a tabular dataset in a pandas DataFrame df. The goal is to predict a continuous target column target. Build a leakage-safe baseline linear regression pipeline and report performance and coefficients.
Requirements
-
Split the data into train/validation sets (e.g., 80/20) using a fixed random seed.
-
Avoid data leakage: fit imputers/encoders/scalers only on training data by using a single sklearn Pipeline.
-
Preprocess features:
-
Impute missing values: median for numeric, most_frequent for categorical.
-
One-hot encode categorical features (drop one level to avoid multicollinearity; ignore unknown categories at validation).
-
Scale numeric features (standardization).
-
Train an ordinary least squares linear regression model.
-
Evaluate on the validation set using RMSE and R^2.
-
Output:
-
Fitted coefficients with feature names (and intercept).
-
Key metrics (RMSE, R^2).
-
A brief interpretation of results.
Assumptions
-
DataFrame: df with target column named target; all remaining columns are features.
Deliverable
Provide clean, runnable Python code plus a short interpretation of the metrics and coefficients.