How do I approach Machine Learning interview questions?

Machine Learning questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master machine learning interviews.

What difficulty level is this interview question?

This is a hard difficulty Machine Learning question, commonly asked during Technical Screen rounds at Two Sigma.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at Two Sigma during technical interviews.

Derive correlation bounds and omitted-variable bias

Q: Derive correlation bounds and omitted-variable bias

This question evaluates understanding of multivariate correlation structure and linear regression properties, focusing on feasible ranges and constructions for equal pairwise correlations and the derivation of omitted-variable bias in ordinary least squares.

Core Statistics Prompt

This is the core statistics round of a multi-stage Data Scientist interview. It bundles two independent statistics questions. Treat each Part on its own; they share no variables.

Constraints & Assumptions

All random variables are real-valued with finite second moments.
In Part A, each variable has unit variance , so the covariance matrix and the correlation matrix coincide.
In Part B, "true model" means the data are actually generated by the full linear model with classical OLS error assumptions ( $\mathbb{E}[\varepsilon \mid X_1, X_2] = 0$ ) unless you argue otherwise.
Standard linear-algebra / probability tools (eigen-decomposition, Cholesky, OLS normal equations) are fair game; no simulation required.

Clarifying Questions to Ask

Part A: Are we asked for the range over which a valid joint distribution exists, or just the range that keeps the correlation matrix valid (positive semidefinite)? (They coincide here, but stating it shows rigor.)
Part A: Should the construction work for all feasible $p$ (including the negative regime), or is a positive- $p$ construction acceptable for partial credit?
Part B: Is "impact on $\tilde\beta_1$ " asking for the bias of the estimator (an expectation statement), or for the full sampling distribution?
Part B: Are $X_1, X_2$ full-column-rank (so the relevant Gram matrices are invertible), and are we treating the regressors as fixed/conditioned?

Part A — Equal pairwise correlation

Let $X, Y, Z$ be random variables with unit variance and equal pairwise correlation:

\mathrm{Corr}(X,Y)=\mathrm{Corr}(Y,Z)=\mathrm{Corr}(X,Z)=p.

What values of $p$ are feasible?
Give a method to construct $(X,Y,Z)$ that achieves any feasible $p$ .
Generalize: for $n$ variables with the same pairwise correlation $p$ , what is the feasible range of $p$ , and how would you construct them?

What This Part Should Cover

Frames feasibility as PSD-ness of the correlation matrix (not an ad-hoc " $|p|\le 1$ " guess).
Derives both eigenvalues from the matrix structure and uses $\lambda \ge 0$ to obtain a two-sided bound on $p$ , recognizing that the lower bound is $n$ -dependent and tightens toward $0$ as $n$ grows (without stating the closed form).
Gives a construction that provably achieves the target $p$ , and is honest about where the simple factor model breaks (negative $p$ ), offering a general fallback.
Notes the asymmetry of the bounds and the intuition for why strong negative correlation among many variables is impossible.

Part B — Omitted variable bias

Consider the true linear regression model

\mathbf{y}=X_1\beta_1 + X_2\beta_2 + \varepsilon,

but you mistakenly fit the reduced model $\mathbf{y}=X_1\tilde\beta_1+\text{error}$ , omitting $X_2$ .

What is the impact on the estimated coefficient $\tilde\beta_1$ ?
Prove the result using matrix notation (OLS).

What This Part Should Cover

Sets up the reduced-model estimator and substitutes the true $\mathbf{y}$ cleanly.
Isolates the bias term $(X_1^\top X_1)^{-1}X_1^\top X_2\,\beta_2$ and states the two conditions under which it is zero ( $\beta_2=0$ or $X_1 \perp X_2$ ).
Interprets the direction/sign of the bias in terms of the correlation between $X_1,X_2$ and the sign of $\beta_2$ .
Connects to the assumption-violation view: omitting a relevant regressor pushes its effect into the error term, making the reduced error correlated with $X_1$ (endogeneity), so OLS is no longer unbiased/consistent.

What a Strong Answer Covers

These dimensions span both Parts:

Reduces a "what values are possible / what is the effect" question to a precise linear-algebra condition (PSD-ness; the normal equations) rather than reasoning informally.
States assumptions explicitly (unit variance / PSD validity; $\mathbb{E}[\varepsilon\mid X]=0$ , full column rank) and flags where they matter.
Distinguishes finite-sample (in-sample $X_1^\top X_2=0$ ) from population (true correlation) statements where relevant.

Follow-up Questions

Part A: For $n$ variables, what is $\lim_{n\to\infty}$ of the feasible lower bound on $p$ , and what does that imply about whether infinitely many variables can be mutually negatively correlated?
Part A: If the variables are only required to be jointly distributed (not Gaussian), does the feasible range for $p$ change? Why or why not?
Part B: Is $\tilde\beta_1$ also inconsistent (biased as $n\to\infty$ ), or only finite-sample biased? Under what condition does the bias persist asymptotically?
Part B: How does omitting $X_2$ affect the estimated standard errors and the residual variance $\hat\sigma^2$ , separately from the point-estimate bias?

Core Statistics Prompt

This is the core statistics round of a multi-stage Data Scientist interview. It bundles two independent statistics questions. Treat each Part on its own; they share no variables.

Constraints & Assumptions

All random variables are real-valued with finite second moments.
In Part A, each variable has unit variance , so the covariance matrix and the correlation matrix coincide.
In Part B, "true model" means the data are actually generated by the full linear model with classical OLS error assumptions ( $\mathbb{E}[\varepsilon \mid X_1, X_2] = 0$ ) unless you argue otherwise.
Standard linear-algebra / probability tools (eigen-decomposition, Cholesky, OLS normal equations) are fair game; no simulation required.

Clarifying Questions to Ask

Part A: Are we asked for the range over which a valid joint distribution exists, or just the range that keeps the correlation matrix valid (positive semidefinite)? (They coincide here, but stating it shows rigor.)
Part A: Should the construction work for all feasible $p$ (including the negative regime), or is a positive- $p$ construction acceptable for partial credit?
Part B: Is "impact on $\tilde\beta_1$ " asking for the bias of the estimator (an expectation statement), or for the full sampling distribution?
Part B: Are $X_1, X_2$ full-column-rank (so the relevant Gram matrices are invertible), and are we treating the regressors as fixed/conditioned?

Part A — Equal pairwise correlation

Let $X, Y, Z$ be random variables with unit variance and equal pairwise correlation:

\mathrm{Corr}(X,Y)=\mathrm{Corr}(Y,Z)=\mathrm{Corr}(X,Z)=p.

What values of $p$ are feasible?
Give a method to construct $(X,Y,Z)$ that achieves any feasible $p$ .
Generalize: for $n$ variables with the same pairwise correlation $p$ , what is the feasible range of $p$ , and how would you construct them?

What This Part Should Cover

Frames feasibility as PSD-ness of the correlation matrix (not an ad-hoc " $|p|\le 1$ " guess).
Derives both eigenvalues from the matrix structure and uses $\lambda \ge 0$ to obtain a two-sided bound on $p$ , recognizing that the lower bound is $n$ -dependent and tightens toward $0$ as $n$ grows (without stating the closed form).
Gives a construction that provably achieves the target $p$ , and is honest about where the simple factor model breaks (negative $p$ ), offering a general fallback.
Notes the asymmetry of the bounds and the intuition for why strong negative correlation among many variables is impossible.

Part B — Omitted variable bias

Consider the true linear regression model

\mathbf{y}=X_1\beta_1 + X_2\beta_2 + \varepsilon,

but you mistakenly fit the reduced model $\mathbf{y}=X_1\tilde\beta_1+\text{error}$ , omitting $X_2$ .

What is the impact on the estimated coefficient $\tilde\beta_1$ ?
Prove the result using matrix notation (OLS).

What This Part Should Cover

Sets up the reduced-model estimator and substitutes the true $\mathbf{y}$ cleanly.
Isolates the bias term $(X_1^\top X_1)^{-1}X_1^\top X_2\,\beta_2$ and states the two conditions under which it is zero ( $\beta_2=0$ or $X_1 \perp X_2$ ).
Interprets the direction/sign of the bias in terms of the correlation between $X_1,X_2$ and the sign of $\beta_2$ .
Connects to the assumption-violation view: omitting a relevant regressor pushes its effect into the error term, making the reduced error correlated with $X_1$ (endogeneity), so OLS is no longer unbiased/consistent.

What a Strong Answer Covers

These dimensions span both Parts:

Reduces a "what values are possible / what is the effect" question to a precise linear-algebra condition (PSD-ness; the normal equations) rather than reasoning informally.
States assumptions explicitly (unit variance / PSD validity; $\mathbb{E}[\varepsilon\mid X]=0$ , full column rank) and flags where they matter.
Distinguishes finite-sample (in-sample $X_1^\top X_2=0$ ) from population (true correlation) statements where relevant.

Follow-up Questions

Part A: For $n$ variables, what is $\lim_{n\to\infty}$ of the feasible lower bound on $p$ , and what does that imply about whether infinitely many variables can be mutually negatively correlated?
Part A: If the variables are only required to be jointly distributed (not Gaussian), does the feasible range for $p$ change? Why or why not?
Part B: Is $\tilde\beta_1$ also inconsistent (biased as $n\to\infty$ ), or only finite-sample biased? Under what condition does the bias persist asymptotically?
Part B: How does omitting $X_2$ affect the estimated standard errors and the residual variance $\hat\sigma^2$ , separately from the point-estimate bias?

Derive correlation bounds and omitted-variable bias

Quick Overview

Derive correlation bounds and omitted-variable bias

Core Statistics Prompt

Constraints & Assumptions

Clarifying Questions to Ask

Part A — Equal pairwise correlation

What This Part Should Cover

Part B — Omitted variable bias

What This Part Should Cover

What a Strong Answer Covers

Follow-up Questions

Write your answer

Derive correlation bounds and omitted-variable bias

Quick Overview

Derive correlation bounds and omitted-variable bias

Core Statistics Prompt

Constraints & Assumptions

Clarifying Questions to Ask

Part A — Equal pairwise correlation

What This Part Should Cover

Part B — Omitted variable bias

What This Part Should Cover

What a Strong Answer Covers

Follow-up Questions

Write your answer