Improve Model Generalization with Cross-Validation and Feature Engineering

Q: Improve Model Generalization with Cross-Validation and Feature Engineering

This question evaluates a data scientist's practical competence in supervised machine learning, covering stratified train/test splitting, reproducible preprocessing pipelines that standardize numeric features and robustly encode categoricals, training gradient-boosted models, and assessing discrimination with ROC AUC.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

Predict Next-Month Orders: Train/Test Split, Pipeline, and AUC

Context

You are given a cleaned tabular retail dataset as a pandas DataFrame df. The binary target column will_order_next_month indicates whether a customer will place an order in the following month (1 = yes, 0 = no).

Tasks

Split the data into 80/20 train–test sets with stratification on the target.
Build a reproducible scikit-learn pipeline that:
- Standardizes numeric features.
- One-hot encodes categorical features (robust to unseen categories at test time).
Train a gradient-boosted tree model (e.g., XGBoost or LightGBM).
Report ROC AUC on the held-out test set.
If AUC is low, list two techniques you would use to improve model generalization.

Hints

Demonstrate scikit-learn pipelines and proper evaluation.
Use ColumnTransformer to preprocess numerics and categoricals in one pipeline.
Ensure reproducibility with fixed random seeds.

Improve Model Generalization with Cross-Validation and Feature Engineering

Predict Next-Month Orders: Train/Test Split, Pipeline, and AUC

Context

Tasks

Hints

Solution

Comments (0)