Choose Between Linear Regression and a Decision Tree Under a Hinge and Interaction DGP
Context
You have 100,000 i.i.d. observations with features x1 (range 0–100), x2, x3, and target y. The true data-generating process (unknown to you) is piecewise linear with a hinge at x1 = 50 and an interaction between x2 and x3:
-
y ≈ 3·x1 + 20·I[x1 > 50] + 2·(x2·x3) + ε
-
Heteroskedastic noise: Var(ε | x1) = 0.01·(1 + x1)
Task
Design an analysis to decide between linear regression and a decision tree. Specify:
-
Feature engineering and tests you would run for linearity (e.g., spline basis for x1, x2:x3 interaction) and how you would check residual diagnostics for heteroskedasticity.
-
A fair comparison protocol (CV split, identical preprocessing) and metrics.
-
How you would enforce monotonicity or interaction constraints in a tree-based model to reflect domain knowledge.
-
Which model you expect to generalize better here and why, including bias–variance reasoning and how you would quantify it with learning curves.