This question evaluates proficiency in linear regression theory, including identifiability and the sampling distribution of OLS, together with high-dimensional competencies such as regularization, variable selection, dimensionality reduction, properties of the Moore–Penrose pseudoinverse, and the statistical consequences of naive upsampling.
Consider the linear model y = Xb + ε with X ∈ R^{n×(m+1)} including an intercept. a) Derive the OLS estimator b̂ = (XᵀX)^{-1}Xᵀy, stating the rank conditions for identifiability and the sampling distribution of b̂ under classical assumptions. b) Now suppose m > n. Describe at least three viable approaches (e.g., ridge: b̂_ridge = (XᵀX + λI)^{-1}Xᵀy; lasso; elastic net; forward selection; PCA/PLS), including how you would choose λ and check generalization (cross‑validation details). c) When does the Moore–Penrose pseudoinverse give a reasonable minimum‑norm solution, and what are its drawbacks? d) Explain why naive upsampling of rows does not resolve rank deficiency and can harm inference.