Statistical Inference, Power, And Metric Uncertainty

What's being tested

Meta Data Scientists are expected to reason from noisy user-level data to defensible product conclusions: estimate quantities like average comments per `DAU`, quantify uncertainty, compare model or product variants, and avoid false discoveries. Interviewers are probing whether you know when Central Limit Theorem approximations are valid, how to construct and interpret confidence intervals, how to handle skewed or count-based metrics, and how to design tests without inflating false-positive rates. They also care whether you can communicate assumptions clearly: independence, random sampling, treatment assignment, metric definition, and whether the uncertainty is statistical, measurement-related, or causal. Strong answers connect formulas to product decisions, such as whether a ranking change, comment composer tweak, or chatbot model is ready to ship.

Core knowledge

Expectation is the long-run average of a random variable: $E[X] = \sum_x xP(X=x)$ for discrete outcomes. For user comments, the sample mean $\bar X = \frac{1}{n}\sum_i X_i$ estimates expected comments per user or per `DAU`, depending on the sampling unit.
Sample variance uses $n-1$ , not $n$ , when estimating population variance from data: $s^2 = \frac{1}{n-1}\sum_i (X_i-\bar X)^2.$ Use the population standard deviation only when the full population distribution is known, which is rare in interview settings.
Central Limit Theorem says $\bar X$ is approximately normal for large $n$ if observations are independent enough and variance is finite: $\bar X \approx N\left(\mu, \frac{\sigma^2}{n}\right).$ It can work well for user-level metrics with large `DAU`, but heavy tails and clustering slow convergence.
A 95% confidence interval for a mean is commonly $\bar X \pm 1.96\frac{s}{\sqrt n}$ when $n$ is large. For small samples, use a t-interval: $\bar X \pm t_{0.975,n-1}\frac{s}{\sqrt n}$ , especially if variance is estimated and normality is plausible.
Count data like comments per user are often skewed, zero-inflated, and overdispersed relative to a Poisson model, where $E[X]=Var(X)$ . A negative binomial or nonparametric approach is often more realistic if a few highly active users dominate variance.
Bootstrap inference resamples users with replacement and recomputes the statistic, producing an empirical uncertainty distribution. For most interview-scale answers, 1,000–10,000 bootstrap replicates is enough; for very large samples, resample at the user level rather than event level to preserve the analysis unit.
Independence is an assumption, not a given. User outcomes can be correlated through social graph effects, shared content, geography, or time shocks. If treatment is assigned by cluster, session, page, or conversation, standard errors must reflect that assignment unit.
Power is the probability of detecting a true effect: $1-\beta$ . For a two-sample mean comparison with equal group sizes, approximate required per-arm sample size is $n \approx \frac{2\sigma^2(z_{1-\alpha/2}+z_{1-\beta})^2}{\delta^2},$ where $\delta$ is the minimum detectable effect.
Hypothesis testing separates effect size from uncertainty. A tiny lift in comments can be statistically significant with millions of users but product-irrelevant. Always report both the estimate and interval, for example “+0.3% comments per `DAU`, 95% CI [+0.1%, +0.5%].”
Sequential testing requires a pre-planned correction if you repeatedly peek at results. Pocock boundaries spend alpha relatively evenly; O’Brien–Fleming boundaries are stricter early and closer to conventional thresholds later. Naively stopping when $p<0.05$ inflates false positives.
Always-valid inference methods such as mixture SPRT, e-values, or confidence sequences allow continuous monitoring while controlling error rates under specified assumptions. In a DS interview, you do not need to derive them fully, but you should know why they avoid p-hacking better than ad hoc peeking.
Joint probability questions often test whether you distinguish independence from correlation. If “honest” and “relevant” chatbot answers are independent, $P(H \cap R)=P(H)P(R)$ ; without independence, use $P(H \cap R)=P(H)P(R\mid H)$ and ask how labels were collected.

Worked example

For Analyze Central Limit Theorem in User Comment Distribution, a strong candidate first clarifies the sampling unit: “Are we sampling users from a day’s `DAU`, sessions, or comments? Is the target average comments per active user, total comments, or expected comments for a randomly chosen user?” They would declare assumptions: observations are user-level, sampled randomly from the relevant population, and each user contributes one count of comments for the day.

The answer should then be organized around four pillars. First, define the estimator: $\bar X$ estimates expected comments per active user, while $n\bar X$ estimates total comments for a population of size $n$ only if the sample represents that population. Second, discuss variability using $s/\sqrt n$ , emphasizing that skewed counts can still have an approximately normal mean when $n$ is large. Third, build a confidence interval with either a normal or t critical value depending on sample size. Fourth, interpret the interval in product language: repeated samples would produce intervals covering the true mean about 95% of the time, not “there is a 95% probability this specific interval contains the truth.”

One tradeoff to flag is whether to rely on the CLT or use a bootstrap. If the distribution has many zeros and a few extreme commenters, the bootstrap may better reflect uncertainty for medians, percentiles, or trimmed means, while the CLT is still usually reasonable for the mean at large scale. A strong close would be: “If I had more time, I’d inspect the histogram, top-user contribution, day-of-week effects, and whether the target is user-level average or platform-level total.”

A second angle

For Apply sequential testing without p-hacking, the same uncertainty concepts apply, but the main risk shifts from estimating one interval to controlling error under repeated decisions. A candidate should immediately ask how often results will be checked, whether the stopping rule is pre-registered, and whether the metric is primary or one of many guardrails. Instead of a fixed-horizon $p<0.05$ test, they should propose an alpha-spending plan such as Pocock or O’Brien–Fleming, or an always-valid method if continuous monitoring is operationally necessary. The transferable idea is that uncertainty statements are only valid under their design assumptions; changing the stopping rule after seeing data changes the meaning of the p-value. The product framing is also different: early stopping may save users from a harmful launch, but it usually costs power or requires stricter evidence.

Common pitfalls

Pitfall: Treating the CLT as “the data are normal.”

The CLT is about the sampling distribution of the mean, not the raw distribution of comments. A count distribution can be extremely skewed while the mean is approximately normal; a better answer says, “The user-level counts are not normal, but $\bar X$ may be approximately normal if $n$ is large and dependence is limited.”

Pitfall: Giving a formula without defining the unit of analysis.

Saying $\bar X \pm 1.96s/\sqrt n$ is incomplete if $n$ might mean comments, users, sessions, or conversations. Meta interviewers expect you to anchor metrics to entities like user-day, `DAU`, treatment arm, or labeled chatbot response; otherwise your standard error may be artificially small.

Pitfall: Confusing statistical significance with launch readiness.

A $p$ -value below 0.05 does not mean the effect is large, causal under all conditions, or safe for all segments. Stronger communication pairs the estimate with confidence intervals, practical significance, guardrail metrics, and whether the test design controlled for peeking or multiple comparisons.

Connections

Interviewers may pivot from here to A/B testing, causal inference, multiple hypothesis correction, metric design, or model evaluation for ranking and chatbot systems. Be ready to discuss variance reduction methods like CUPED, heterogeneous treatment effects across cohorts, and how offline evaluation metrics connect to online user outcomes.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts