You are given two pandas datasets to fit an OLS model without an intercept (through origin). Dataset A (features): df_X(user_id, clicks, impressions). Dataset B (target): df_y(user_id, conversions). Tiny samples:
df_X
A | 10 | 100
B | 5 | 40
C | 0 | 20
(df_X columns: user_id, clicks, impressions)
df_y
A | 4
B | 2
C | 1
(df_y columns: user_id, conversions)
Tasks:
-
Inner-join on user_id to build X ∈ R^{n×p} with columns [clicks, impressions] and y ∈ R^{n}. Fit the no-intercept model y = Xβ via: (i) normal equations β̂ = (XᵀX)^{-1}Xᵀy; and (ii) a numerically stable factorization (QR or SVD). Show the intermediate matrices (XᵀX, Xᵀy) and the final β̂. 2) Compute R² for the no-intercept model using the correct definition (with TSS = ∑ y_i^2). Explain why this R² can be negative and compare against the standard with an intercept. 3) Discuss when omitting the intercept is appropriate; verify by mean-centering X and y and refitting both with- and without-intercept, commenting on coefficient changes. 4) If XᵀX is singular or ill-conditioned (e.g., clicks collinear with impressions), detect this and refit using ridge regression with λ = 1e−3; give the closed-form β̂_ridge = (XᵀX + λI)^{-1}Xᵀy and compute it on the sample. 5) State how you would validate the model (residual diagnostics, cross-validation) and guard against data leakage when constructing df_X and df_y.