This question evaluates competence in statistical modeling and causal inference, covering regression diagnostics, feature engineering and interactions, appropriate error distributions and link functions, leakage detection, model selection and validation, and the trade-off between predictive accuracy and valid effect estimation.
You are modeling contribution per order (a continuous per-order outcome such as margin or profit contribution) using a linear regression. The current model achieves R² = 0.07, indicating weak predictive performance. You care about both prediction accuracy and valid inference on key covariates (e.g., treatment effects, policy variables).
(a) List concrete, practical steps to raise predictive performance without invalidating inference. Include:
(b) Will simply adding another covariate reliably increase R² out-of-sample? Use cross-validation (CV) to demonstrate why or why not, and propose alternatives (GAMs, quantile regression, gradient boosting) that balance predictive performance with effect-estimation goals.
(c) Show how to use nested cross-validation and target-leakage tests to guard against p-hacking while iterating on features/hyperparameters.
(d) Explain when a low R² is acceptable for an unbiased average treatment effect (ATE) but unacceptable for accurate individual predictions.
Login required