Improve Model Generalization with Cross-Validation and Feature Engineering
Predict Next-Month Orders: Train/Test Split, Pipeline, and AUC
Context
You are given a cleaned tabular retail dataset as a pandas DataFrame df. The binary target column will_order_next_month indicates whether a customer will place an order in the following month (1 = yes, 0 = no).
Tasks
-
Split the data into 80/20 train–test sets with stratification on the target.
-
Build a reproducible scikit-learn pipeline that:
-
Standardizes numeric features.
-
One-hot encodes categorical features (robust to unseen categories at test time).
-
Train a gradient-boosted tree model (e.g., XGBoost or LightGBM).
-
Report ROC AUC on the held-out test set.
-
If AUC is low, list two techniques you would use to improve model generalization.
Hints
-
Demonstrate scikit-learn pipelines and proper evaluation.
-
Use
ColumnTransformer
to preprocess numerics and categoricals in one pipeline.
-
Ensure reproducibility with fixed random seeds.
Constraints & Assumptions
-
Preserve the scope, facts, inputs, and requested outputs from the prompt above.
-
If the prompt leaves a detail unspecified, state a reasonable assumption before relying on it.
-
Keep the answer interview-ready: concise enough to present, but concrete enough to implement or evaluate.
Clarifying Questions to Ask
-
Clarify the task, data shape, labels, constraints, and evaluation metric.
-
State assumptions behind the math or modeling technique you choose.
-
Connect theory to practical training, debugging, and deployment implications.
What a Strong Answer Covers
-
Correct definitions and formulas where the prompt requires them.
-
A practical explanation of how the method behaves on real data.
-
Trade-offs, failure modes, diagnostics, and mitigation strategies.
-
Evaluation choices that match the product or modeling objective.
Follow-up Questions
-
How would noisy labels, class imbalance, or distribution shift affect the answer?
-
What would you monitor after deployment?
-
Which baseline would you compare against first?