How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a Medium difficulty Machine Learning question, commonly asked during Take-home Project rounds at Two Sigma.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Two Sigma during technical interviews.

Perform no-intercept linear regression from two datasets

Quick Overview

This question evaluates linear regression modeling, numerical linear algebra (normal equations and stable factorization methods), regularization, R² interpretation, and data-preprocessing/validation skills within the Machine Learning domain for a Data Scientist role.

You are given two pandas datasets to fit an OLS model without an intercept (through origin). Dataset A (features): df_X(user_id, clicks, impressions). Dataset B (target): df_y(user_id, conversions). Tiny samples:

df_X A | 10 | 100 B | 5 | 40 C | 0 | 20

(df_X columns: user_id, clicks, impressions)

df_y A | 4 B | 2 C | 1

(df_y columns: user_id, conversions)

Tasks:

Inner-join on user_id to build X ∈ R^{n×p} with columns [clicks, impressions] and y ∈ R^{n}. Fit the no-intercept model y = Xβ via: (i) normal equations β̂ = (XᵀX)^{-1}Xᵀy; and (ii) a numerically stable factorization (QR or SVD). Show the intermediate matrices (XᵀX, Xᵀy) and the final β̂. 2) Compute R² for the no-intercept model using the correct definition (with TSS = ∑ y_i^2). Explain why this R² can be negative and compare against the standard with an intercept. 3) Discuss when omitting the intercept is appropriate; verify by mean-centering X and y and refitting both with- and without-intercept, commenting on coefficient changes. 4) If XᵀX is singular or ill-conditioned (e.g., clicks collinear with impressions), detect this and refit using ridge regression with λ = 1e−3; give the closed-form β̂_ridge = (XᵀX + λI)^{-1}Xᵀy and compute it on the sample. 5) State how you would validate the model (residual diagnostics, cross-validation) and guard against data leakage when constructing df_X and df_y.

Quick Overview

df_X A | 10 | 100 B | 5 | 40 C | 0 | 20

(df_X columns: user_id, clicks, impressions)

df_y A | 4 B | 2 C | 1

(df_y columns: user_id, conversions)

Tasks:

Inner-join on user_id to build X ∈ R^{n×p} with columns [clicks, impressions] and y ∈ R^{n}. Fit the no-intercept model y = Xβ via: (i) normal equations β̂ = (XᵀX)^{-1}Xᵀy; and (ii) a numerically stable factorization (QR or SVD). Show the intermediate matrices (XᵀX, Xᵀy) and the final β̂. 2) Compute R² for the no-intercept model using the correct definition (with TSS = ∑ y_i^2). Explain why this R² can be negative and compare against the standard with an intercept. 3) Discuss when omitting the intercept is appropriate; verify by mean-centering X and y and refitting both with- and without-intercept, commenting on coefficient changes. 4) If XᵀX is singular or ill-conditioned (e.g., clicks collinear with impressions), detect this and refit using ridge regression with λ = 1e−3; give the closed-form β̂_ridge = (XᵀX + λI)^{-1}Xᵀy and compute it on the sample. 5) State how you would validate the model (residual diagnostics, cross-validation) and guard against data leakage when constructing df_X and df_y.

Perform no-intercept linear regression from two datasets

Quick Overview

Comments (0)

Perform no-intercept linear regression from two datasets

Quick Overview

Comments (0)