Why Is the Sample Mean Approximately Normal for Large Samples?
Company: Two Sigma
Role: Data Scientist
Category: Machine Learning
Difficulty: medium
Interview Round: Technical Screen
In statistics we routinely treat the average of a large sample as if it were normally distributed — for example, when building confidence intervals for a mean. Why is that justified?
Explain precisely **what quantity** becomes approximately normal as the sample size grows, state the theorem that justifies it along with its conditions, give an intuitive argument (or a proof sketch) for why it is true, and describe concrete situations where the normal approximation fails or is poor.
```hint Be precise about what converges
"A large sample is approximately normal" is a common misstatement — the data's distribution never changes with $n$. The object that becomes normal is the **standardized sample mean** (equivalently, the standardized sum). Naming the Central Limit Theorem is the start, not the answer.
```
```hint A route to "why"
Consider the characteristic function (or moment generating function) of a standardized sum of independent variables: what does raising a second-order expansion to the $n$-th power converge to? Alternatively, think about why convolving many independent distributions keeps smoothing the result toward one universal shape.
```
### Constraints & Assumptions
- Assume independent, identically distributed draws unless you explicitly relax that.
- Whiteboard-level rigor is expected: a precise statement plus a convincing sketch, not a measure-theoretic proof.
- You should address both the "why it works" and the "when it breaks" sides.
### Clarifying Questions to Ask
- Do you want the formal statement with conditions, the intuition for why it holds, or both?
- May I assume i.i.d. sampling with finite variance, or should I discuss what happens when those assumptions are relaxed?
- Are you also interested in *how fast* the approximation becomes good (rates of convergence), or just the limiting statement?
### What a Strong Answer Covers
- A precise statement of the Central Limit Theorem: it is the standardized sample mean $\sqrt{n}(\bar{X}_n - \mu)/\sigma$ that converges **in distribution** to $N(0,1)$, under i.i.d. sampling with finite variance — clearly distinguished from the misconception that "the sample becomes normal."
- The distinction between the Law of Large Numbers (where the mean goes) and the CLT (the shape and $1/\sqrt{n}$ scale of the fluctuations around it).
- A credible "why": a characteristic-function proof sketch, or the convolution/aggregation intuition that sums of many small independent effects wash out the details of the individual distribution.
- Conditions and failure modes: infinite variance (heavy tails), strong dependence, a single dominating term, and slow convergence for very skewed distributions — plus what the relaxed versions (non-identical distributions) still require.
- The practical consequences for inference: standard errors, confidence intervals, and when to distrust the approximation at realistic sample sizes.
### Follow-up Questions
- Give a distribution for which the sample mean is *never* approximately normal, no matter how large $n$ is, and explain why the theorem's hypotheses fail.
- How fast does the approximation improve with $n$, and what feature of the underlying distribution controls that rate?
- The observations in a time series are dependent. Does anything like the CLT still hold, and what has to be true for it to?
- If you suspect the normal approximation is poor for your sample size, what would you do instead to get a confidence interval for the mean?