Derive correlation bounds and omitted-variable bias
Company: Two Sigma
Role: Data Scientist
Category: Machine Learning
Difficulty: hard
Interview Round: Technical Screen
## Core Statistics Prompt
This is the *core statistics* round of a multi-stage Data Scientist interview. It bundles two independent statistics questions. Treat each Part on its own; they share no variables.
### Constraints & Assumptions
- All random variables are real-valued with finite second moments.
- In Part A, each variable has **unit variance**, so the covariance matrix and the correlation matrix coincide.
- In Part B, "true model" means the data are actually generated by the full linear model with classical OLS error assumptions ($\mathbb{E}[\varepsilon \mid X_1, X_2] = 0$) unless you argue otherwise.
- Standard linear-algebra / probability tools (eigen-decomposition, Cholesky, OLS normal equations) are fair game; no simulation required.
### Clarifying Questions to Ask
- Part A: Are we asked for the range over which a *valid joint distribution* exists, or just the range that keeps the correlation **matrix** valid (positive semidefinite)? (They coincide here, but stating it shows rigor.)
- Part A: Should the construction work for **all** feasible $p$ (including the negative regime), or is a positive-$p$ construction acceptable for partial credit?
- Part B: Is "impact on $\tilde\beta_1$" asking for the bias of the estimator (an expectation statement), or for the full sampling distribution?
- Part B: Are $X_1, X_2$ full-column-rank (so the relevant Gram matrices are invertible), and are we treating the regressors as fixed/conditioned?
### Part A — Equal pairwise correlation
Let $X, Y, Z$ be random variables with unit variance and **equal pairwise correlation**:
$$
\mathrm{Corr}(X,Y)=\mathrm{Corr}(Y,Z)=\mathrm{Corr}(X,Z)=p.
$$
1. What values of $p$ are feasible?
2. Give a method to **construct** $(X,Y,Z)$ that achieves any feasible $p$.
3. Generalize: for **$n$** variables with the same pairwise correlation $p$, what is the feasible range of $p$, and how would you construct them?
```hint Where to start
A vector of correlations is feasible iff the resulting correlation **matrix** is a valid covariance matrix, i.e. **positive semidefinite (PSD)**. Reduce "which $p$ are feasible" to "for which $p$ is $R$ PSD".
```
```hint Exploit the symmetry
The equicorrelation matrix $R = (1-p)I + p\,\mathbf{1}\mathbf{1}^\top$ has only two distinct eigenvalues. The all-ones vector $\mathbf{1}$ is one eigenvector; everything orthogonal to it is the other eigenspace. Both eigenvalues must be $\ge 0$ — that gives you a two-sided bound on $p$.
```
```hint Construction direction
For $p \ge 0$, think about a **shared common-factor** structure where each variable loads on one latent source — that naturally injects positive co-movement and keeps unit variance. For the negative regime, $\sqrt{p}$ is undefined, so you need a more general PSD recipe (e.g. Cholesky decomposition of $R$).
```
#### What This Part Should Cover
- Frames feasibility as PSD-ness of the correlation matrix (not an ad-hoc "$|p|\le 1$" guess).
- Derives both eigenvalues from the matrix structure and uses $\lambda \ge 0$ to obtain a two-sided bound on $p$, recognizing that the lower bound is $n$-dependent and tightens toward $0$ as $n$ grows (without stating the closed form).
- Gives a construction that *provably* achieves the target $p$, and is honest about where the simple factor model breaks (negative $p$), offering a general fallback.
- Notes the asymmetry of the bounds and the intuition for why strong negative correlation among many variables is impossible.
### Part B — Omitted variable bias
Consider the true linear regression model
$$
\mathbf{y}=X_1\beta_1 + X_2\beta_2 + \varepsilon,
$$
but you mistakenly fit the **reduced** model $\mathbf{y}=X_1\tilde\beta_1+\text{error}$, omitting $X_2$.
1. What is the impact on the estimated coefficient $\tilde\beta_1$?
2. Prove the result using matrix notation (OLS).
```hint Where to start
Write the reduced-model OLS estimator $\tilde\beta_1 = (X_1^\top X_1)^{-1}X_1^\top \mathbf{y}$, then **substitute the true data-generating $\mathbf{y}$** into it and take the conditional expectation.
```
#### What This Part Should Cover
- Sets up the reduced-model estimator and substitutes the true $\mathbf{y}$ cleanly.
- Isolates the bias term $(X_1^\top X_1)^{-1}X_1^\top X_2\,\beta_2$ and states the two conditions under which it is zero ($\beta_2=0$ **or** $X_1 \perp X_2$).
- Interprets the **direction/sign** of the bias in terms of the correlation between $X_1,X_2$ and the sign of $\beta_2$.
- Connects to the assumption-violation view: omitting a relevant regressor pushes its effect into the error term, making the reduced error correlated with $X_1$ (endogeneity), so OLS is no longer unbiased/consistent.
### What a Strong Answer Covers
These dimensions span both Parts:
- Reduces a "what values are possible / what is the effect" question to a precise linear-algebra condition (PSD-ness; the normal equations) rather than reasoning informally.
- States assumptions explicitly (unit variance / PSD validity; $\mathbb{E}[\varepsilon\mid X]=0$, full column rank) and flags where they matter.
- Distinguishes finite-sample (in-sample $X_1^\top X_2=0$) from population (true correlation) statements where relevant.
### Follow-up Questions
- Part A: For $n$ variables, what is $\lim_{n\to\infty}$ of the feasible lower bound on $p$, and what does that imply about whether infinitely many variables can be mutually negatively correlated?
- Part A: If the variables are only required to be jointly distributed (not Gaussian), does the feasible range for $p$ change? Why or why not?
- Part B: Is $\tilde\beta_1$ also *inconsistent* (biased as $n\to\infty$), or only finite-sample biased? Under what condition does the bias persist asymptotically?
- Part B: How does omitting $X_2$ affect the estimated standard errors and the residual variance $\hat\sigma^2$, separately from the point-estimate bias?
Quick Answer: This question evaluates understanding of multivariate correlation structure and linear regression properties, focusing on feasible ranges and constructions for equal pairwise correlations and the derivation of omitted-variable bias in ordinary least squares.