Build a regularized regression pipeline

Q: Build a regularized regression pipeline

This question evaluates a candidate's competency in building an end-to-end regularized regression pipeline in scikit-learn, covering feature engineering and polynomial expansion, categorical encoding, target‑transform diagnostics, penalized linear model selection and tuning (Ridge/Lasso/ElasticNet), cross‑validation strategies including TimeSeriesSplit, leakage prevention, nested evaluation, feature effect ranking, and model persistence. Commonly asked in Machine Learning interviews for Data Scientist roles, it tests practical application skills in the Machine Learning domain and requires both hands-on implementation proficiency and conceptual understanding of validation design, target transformations, and interpretability, with emphasis on practical application augmented by conceptual judgment.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

Context

You are given a cleaned tabular dataset with marketing and product metrics. Your goal is to predict daily signups using a robust, leakage‑free modeling workflow in scikit‑learn.

Assume the dataset is a pandas DataFrame with at least these columns:

spend (float)
clicks (float)
cpc (float)
region (categorical string)
date (datetime or string parsable to datetime)
signups (non‑negative integer/float target)

Task

Build a model to predict signups from spend, clicks, cpc, region, and a time trend extracted from date. Implement the following in scikit‑learn:

ColumnTransformer
- Numeric features: spend, clicks, cpc, time_index
  - Apply PolynomialFeatures(degree=2, include_bias=False)
  - Then StandardScaler()
- Categorical feature: region → OneHotEncoder(handle_unknown='ignore')
Target transform decision
- Make a quick diagnostic to decide whether to apply log1p to the target (e.g., via cross‑validated RMSE comparison on a simple baseline). Use TransformedTargetRegressor if log1p is selected.
Model selection and tuning
- Build a Pipeline with preprocessing and a linear penalized model.
- Tune Ridge, Lasso, and ElasticNet via GridSearchCV (5‑fold CV by default).
- If temporal dependence is detected, switch both inner and outer CV to TimeSeriesSplit.
Evaluation and leakage control
- Create a held‑out test split before any CV (chronological split if temporal dependence, otherwise random 80/20).
- Ensure all preprocessing (including time_index creation) is inside the pipeline to prevent leakage.
- Use RMSE as the primary metric; also report R^2 on the test set.
- Use nested CV to estimate generalization RMSE.
Feature effects
- After selecting the best model, extract and rank the top 10 features by absolute standardized effect size.
- Map one‑hot feature names back to human‑readable form (e.g., region=NA; interaction terms readable).
Persistence
- Save the trained model to disk and show code to reload and score new data.
Rationale
- Provide a concise rationale for each major modeling choice.

Build a regularized regression pipeline

Context

Task

Solution

Comments (0)

Build a regularized regression pipeline

Overview

Technical Screen: End‑to‑End Signup Prediction with scikit‑learn

Context

Task

Solution

Comments (0)