PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/Machine Learning/Two Sigma

Derive correlation bounds and omitted-variable bias

Last updated: Jun 21, 2026

Quick Overview

This question evaluates understanding of multivariate correlation structure and linear regression properties, focusing on feasible ranges and constructions for equal pairwise correlations and the derivation of omitted-variable bias in ordinary least squares.

  • hard
  • Two Sigma
  • Machine Learning
  • Data Scientist

Derive correlation bounds and omitted-variable bias

Company: Two Sigma

Role: Data Scientist

Category: Machine Learning

Difficulty: hard

Interview Round: Technical Screen

## Core Statistics Prompt This is the *core statistics* round of a multi-stage Data Scientist interview. It bundles two independent statistics questions. Treat each Part on its own; they share no variables. ### Constraints & Assumptions - All random variables are real-valued with finite second moments. - In Part A, each variable has **unit variance**, so the covariance matrix and the correlation matrix coincide. - In Part B, "true model" means the data are actually generated by the full linear model with classical OLS error assumptions ($\mathbb{E}[\varepsilon \mid X_1, X_2] = 0$) unless you argue otherwise. - Standard linear-algebra / probability tools (eigen-decomposition, Cholesky, OLS normal equations) are fair game; no simulation required. ### Clarifying Questions to Ask - Part A: Are we asked for the range over which a *valid joint distribution* exists, or just the range that keeps the correlation **matrix** valid (positive semidefinite)? (They coincide here, but stating it shows rigor.) - Part A: Should the construction work for **all** feasible $p$ (including the negative regime), or is a positive-$p$ construction acceptable for partial credit? - Part B: Is "impact on $\tilde\beta_1$" asking for the bias of the estimator (an expectation statement), or for the full sampling distribution? - Part B: Are $X_1, X_2$ full-column-rank (so the relevant Gram matrices are invertible), and are we treating the regressors as fixed/conditioned? ### Part A — Equal pairwise correlation Let $X, Y, Z$ be random variables with unit variance and **equal pairwise correlation**: $$ \mathrm{Corr}(X,Y)=\mathrm{Corr}(Y,Z)=\mathrm{Corr}(X,Z)=p. $$ 1. What values of $p$ are feasible? 2. Give a method to **construct** $(X,Y,Z)$ that achieves any feasible $p$. 3. Generalize: for **$n$** variables with the same pairwise correlation $p$, what is the feasible range of $p$, and how would you construct them? ```hint Where to start A vector of correlations is feasible iff the resulting correlation **matrix** is a valid covariance matrix, i.e. **positive semidefinite (PSD)**. Reduce "which $p$ are feasible" to "for which $p$ is $R$ PSD". ``` ```hint Exploit the symmetry The equicorrelation matrix $R = (1-p)I + p\,\mathbf{1}\mathbf{1}^\top$ has only two distinct eigenvalues. The all-ones vector $\mathbf{1}$ is one eigenvector; everything orthogonal to it is the other eigenspace. Both eigenvalues must be $\ge 0$ — that gives you a two-sided bound on $p$. ``` ```hint Construction direction For $p \ge 0$, think about a **shared common-factor** structure where each variable loads on one latent source — that naturally injects positive co-movement and keeps unit variance. For the negative regime, $\sqrt{p}$ is undefined, so you need a more general PSD recipe (e.g. Cholesky decomposition of $R$). ``` #### What This Part Should Cover - Frames feasibility as PSD-ness of the correlation matrix (not an ad-hoc "$|p|\le 1$" guess). - Derives both eigenvalues from the matrix structure and uses $\lambda \ge 0$ to obtain a two-sided bound on $p$, recognizing that the lower bound is $n$-dependent and tightens toward $0$ as $n$ grows (without stating the closed form). - Gives a construction that *provably* achieves the target $p$, and is honest about where the simple factor model breaks (negative $p$), offering a general fallback. - Notes the asymmetry of the bounds and the intuition for why strong negative correlation among many variables is impossible. ### Part B — Omitted variable bias Consider the true linear regression model $$ \mathbf{y}=X_1\beta_1 + X_2\beta_2 + \varepsilon, $$ but you mistakenly fit the **reduced** model $\mathbf{y}=X_1\tilde\beta_1+\text{error}$, omitting $X_2$. 1. What is the impact on the estimated coefficient $\tilde\beta_1$? 2. Prove the result using matrix notation (OLS). ```hint Where to start Write the reduced-model OLS estimator $\tilde\beta_1 = (X_1^\top X_1)^{-1}X_1^\top \mathbf{y}$, then **substitute the true data-generating $\mathbf{y}$** into it and take the conditional expectation. ``` #### What This Part Should Cover - Sets up the reduced-model estimator and substitutes the true $\mathbf{y}$ cleanly. - Isolates the bias term $(X_1^\top X_1)^{-1}X_1^\top X_2\,\beta_2$ and states the two conditions under which it is zero ($\beta_2=0$ **or** $X_1 \perp X_2$). - Interprets the **direction/sign** of the bias in terms of the correlation between $X_1,X_2$ and the sign of $\beta_2$. - Connects to the assumption-violation view: omitting a relevant regressor pushes its effect into the error term, making the reduced error correlated with $X_1$ (endogeneity), so OLS is no longer unbiased/consistent. ### What a Strong Answer Covers These dimensions span both Parts: - Reduces a "what values are possible / what is the effect" question to a precise linear-algebra condition (PSD-ness; the normal equations) rather than reasoning informally. - States assumptions explicitly (unit variance / PSD validity; $\mathbb{E}[\varepsilon\mid X]=0$, full column rank) and flags where they matter. - Distinguishes finite-sample (in-sample $X_1^\top X_2=0$) from population (true correlation) statements where relevant. ### Follow-up Questions - Part A: For $n$ variables, what is $\lim_{n\to\infty}$ of the feasible lower bound on $p$, and what does that imply about whether infinitely many variables can be mutually negatively correlated? - Part A: If the variables are only required to be jointly distributed (not Gaussian), does the feasible range for $p$ change? Why or why not? - Part B: Is $\tilde\beta_1$ also *inconsistent* (biased as $n\to\infty$), or only finite-sample biased? Under what condition does the bias persist asymptotically? - Part B: How does omitting $X_2$ affect the estimated standard errors and the residual variance $\hat\sigma^2$, separately from the point-estimate bias?

Quick Answer: This question evaluates understanding of multivariate correlation structure and linear regression properties, focusing on feasible ranges and constructions for equal pairwise correlations and the derivation of omitted-variable bias in ordinary least squares.

Related Interview Questions

  • Analyze Temperatures and Update Regression - Two Sigma (medium)
  • How would you forecast bike demand? - Two Sigma (hard)
  • Predict Bike Dock Demand - Two Sigma (hard)
  • Predict bike demand and avoid overfitting - Two Sigma (hard)
  • How detect duplicate card records? - Two Sigma (medium)
|Home/Machine Learning/Two Sigma

Derive correlation bounds and omitted-variable bias

Two Sigma logo
Two Sigma
Jan 6, 2026, 12:00 AM
hardData ScientistTechnical ScreenMachine Learning
11
0
Loading...

Core Statistics Prompt

This is the core statistics round of a multi-stage Data Scientist interview. It bundles two independent statistics questions. Treat each Part on its own; they share no variables.

Constraints & Assumptions

  • All random variables are real-valued with finite second moments.
  • In Part A, each variable has unit variance , so the covariance matrix and the correlation matrix coincide.
  • In Part B, "true model" means the data are actually generated by the full linear model with classical OLS error assumptions ( E[ε∣X1,X2]=0\mathbb{E}[\varepsilon \mid X_1, X_2] = 0E[ε∣X1​,X2​]=0 ) unless you argue otherwise.
  • Standard linear-algebra / probability tools (eigen-decomposition, Cholesky, OLS normal equations) are fair game; no simulation required.

Clarifying Questions to Ask

  • Part A: Are we asked for the range over which a valid joint distribution exists, or just the range that keeps the correlation matrix valid (positive semidefinite)? (They coincide here, but stating it shows rigor.)
  • Part A: Should the construction work for all feasible ppp (including the negative regime), or is a positive- ppp construction acceptable for partial credit?
  • Part B: Is "impact on β~1\tilde\beta_1β~​1​ " asking for the bias of the estimator (an expectation statement), or for the full sampling distribution?
  • Part B: Are X1,X2X_1, X_2X1​,X2​ full-column-rank (so the relevant Gram matrices are invertible), and are we treating the regressors as fixed/conditioned?

Part A — Equal pairwise correlation

Let X,Y,ZX, Y, ZX,Y,Z be random variables with unit variance and equal pairwise correlation:

Corr(X,Y)=Corr(Y,Z)=Corr(X,Z)=p.\mathrm{Corr}(X,Y)=\mathrm{Corr}(Y,Z)=\mathrm{Corr}(X,Z)=p.Corr(X,Y)=Corr(Y,Z)=Corr(X,Z)=p.
  1. What values of ppp are feasible?
  2. Give a method to construct (X,Y,Z)(X,Y,Z)(X,Y,Z) that achieves any feasible ppp .
  3. Generalize: for nnn variables with the same pairwise correlation ppp , what is the feasible range of ppp , and how would you construct them?

What This Part Should Cover

  • Frames feasibility as PSD-ness of the correlation matrix (not an ad-hoc " ∣p∣≤1|p|\le 1∣p∣≤1 " guess).
  • Derives both eigenvalues from the matrix structure and uses λ≥0\lambda \ge 0λ≥0 to obtain a two-sided bound on ppp , recognizing that the lower bound is nnn -dependent and tightens toward 000 as nnn grows (without stating the closed form).
  • Gives a construction that provably achieves the target ppp , and is honest about where the simple factor model breaks (negative ppp ), offering a general fallback.
  • Notes the asymmetry of the bounds and the intuition for why strong negative correlation among many variables is impossible.

Part B — Omitted variable bias

Consider the true linear regression model

y=X1β1+X2β2+ε,\mathbf{y}=X_1\beta_1 + X_2\beta_2 + \varepsilon,y=X1​β1​+X2​β2​+ε,

but you mistakenly fit the reduced model y=X1β~1+error\mathbf{y}=X_1\tilde\beta_1+\text{error}y=X1​β~​1​+error, omitting X2X_2X2​.

  1. What is the impact on the estimated coefficient β~1\tilde\beta_1β~​1​ ?
  2. Prove the result using matrix notation (OLS).

What This Part Should Cover

  • Sets up the reduced-model estimator and substitutes the true y\mathbf{y}y cleanly.
  • Isolates the bias term (X1⊤X1)−1X1⊤X2 β2(X_1^\top X_1)^{-1}X_1^\top X_2\,\beta_2(X1⊤​X1​)−1X1⊤​X2​β2​ and states the two conditions under which it is zero ( β2=0\beta_2=0β2​=0 or X1⊥X2X_1 \perp X_2X1​⊥X2​ ).
  • Interprets the direction/sign of the bias in terms of the correlation between X1,X2X_1,X_2X1​,X2​ and the sign of β2\beta_2β2​ .
  • Connects to the assumption-violation view: omitting a relevant regressor pushes its effect into the error term, making the reduced error correlated with X1X_1X1​ (endogeneity), so OLS is no longer unbiased/consistent.

What a Strong Answer Covers

These dimensions span both Parts:

  • Reduces a "what values are possible / what is the effect" question to a precise linear-algebra condition (PSD-ness; the normal equations) rather than reasoning informally.
  • States assumptions explicitly (unit variance / PSD validity; E[ε∣X]=0\mathbb{E}[\varepsilon\mid X]=0E[ε∣X]=0 , full column rank) and flags where they matter.
  • Distinguishes finite-sample (in-sample X1⊤X2=0X_1^\top X_2=0X1⊤​X2​=0 ) from population (true correlation) statements where relevant.

Follow-up Questions

  • Part A: For nnn variables, what is lim⁡n→∞\lim_{n\to\infty}limn→∞​ of the feasible lower bound on ppp , and what does that imply about whether infinitely many variables can be mutually negatively correlated?
  • Part A: If the variables are only required to be jointly distributed (not Gaussian), does the feasible range for ppp change? Why or why not?
  • Part B: Is β~1\tilde\beta_1β~​1​ also inconsistent (biased as n→∞n\to\inftyn→∞ ), or only finite-sample biased? Under what condition does the bias persist asymptotically?
  • Part B: How does omitting X2X_2X2​ affect the estimated standard errors and the residual variance σ^2\hat\sigma^2σ^2 , separately from the point-estimate bias?
Loading comments...

Browse More Questions

More Machine Learning•More Two Sigma•More Data Scientist•Two Sigma Data Scientist•Two Sigma Machine Learning•Data Scientist Machine Learning

Write your answer

Your first approved answer each day earns 20 XP.

Sign in to write your answer.
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.