Perform no-intercept linear regression from two datasets

Q: Perform no-intercept linear regression from two datasets

This is a Machine Learning interview question from Two Sigma for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

Question

You are given two pandas datasets to fit an OLS model without an intercept (through origin). Dataset A (features): df_X(user_id, clicks, impressions). Dataset B (target): df_y(user_id, conversions). Tiny samples:

df_X A | 10 | 100 B | 5 | 40 C | 0 | 20

(df_X columns: user_id, clicks, impressions)

df_y A | 4 B | 2 C | 1

(df_y columns: user_id, conversions)

Tasks:

Inner-join on user_id to build X ∈ R^{n×p} with columns [clicks, impressions] and y ∈ R^{n}. Fit the no-intercept model y = Xβ via: (i) normal equations β̂ = (XᵀX)^{-1}Xᵀy; and (ii) a numerically stable factorization (QR or SVD). Show the intermediate matrices (XᵀX, Xᵀy) and the final β̂. 2) Compute R² for the no-intercept model using the correct definition (with TSS = ∑ y_i^2). Explain why this R² can be negative and compare against the standard with an intercept. 3) Discuss when omitting the intercept is appropriate; verify by mean-centering X and y and refitting both with- and without-intercept, commenting on coefficient changes. 4) If XᵀX is singular or ill-conditioned (e.g., clicks collinear with impressions), detect this and refit using ridge regression with λ = 1e−3; give the closed-form β̂_ridge = (XᵀX + λI)^{-1}Xᵀy and compute it on the sample. 5) State how you would validate the model (residual diagnostics, cross-validation) and guard against data leakage when constructing df_X and df_y.

Perform no-intercept linear regression from two datasets

Comments (0)