Statistical Inference, Hypothesis Tests, And Power

What's being tested

Interviewers are probing whether you can turn noisy product data into a defensible decision: ship, iterate, or stop. At Meta, Data Scientists routinely evaluate experiments on feeds, ads, notifications, recommendations, integrity systems, and creator tools where small metric shifts can affect billions of sessions. The skill is not reciting p-values; it is choosing the right hypothesis test, checking assumptions, quantifying uncertainty, and explaining the business risk of false positives and false negatives. Strong candidates show they understand statistical validity under real product constraints: multiple metrics, heterogeneous users, novelty effects, interference, sequential monitoring, and tradeoffs between speed and reliability.

Core knowledge

A hypothesis test starts with a null and alternative: $H_0:\theta_T-\theta_C=0$ versus $H_A:\theta_T-\theta_C\neq 0$ or directional. The p-value is $P(\text{data as or more extreme}\mid H_0)$ , not the probability the null is true.
For large A/B tests comparing means, the standard workhorse is a two-sample z-test or Welch t-test:
$\hat\Delta=\bar X_T-\bar X_C,\quad SE(\hat\Delta)=\sqrt{\frac{s_T^2}{n_T}+\frac{s_C^2}{n_C}}.$
With millions of users, normal approximations usually dominate; with small samples or skewed data, use bootstrap or randomization inference.
For binary metrics such as conversion, click, retention, or completion, use a two-proportion z-test:
$SE(\hat p_T-\hat p_C)=\sqrt{\frac{\hat p_T(1-\hat p_T)}{n_T}+\frac{\hat p_C(1-\hat p_C)}{n_C}}.$
For very rare events, exact tests, Poisson models, or logistic regression may be safer than naive normal approximations.
Power is $P(\text{reject }H_0\mid H_A\text{ true})$ , usually targeted at 80% or 90%. For a balanced two-arm test on a mean metric, approximate per-arm sample size is
$n\approx \frac{2\sigma^2(z_{1-\alpha/2}+z_{1-\beta})^2}{\delta^2},$
where $\delta$ is the minimum detectable effect.
Minimum detectable effect is a business-statistical bridge. A 0.1% relative lift may be statistically detectable at Meta scale but irrelevant if it does not offset engineering, ranking, or user-experience costs. Always distinguish absolute lift, relative lift, and practical significance.
The unit of randomization should match the unit of analysis. If users are randomized but sessions are analyzed as independent, standard errors will be too small because sessions from the same user are correlated. Use user-level aggregation, cluster-robust standard errors, or hierarchical models when repeated observations exist.
Ratio metrics such as CTR, revenue per impression, or messages per active user are tricky because numerator and denominator both vary. Common approaches include user-level ratios, delta method variance, Fieller intervals, bootstrap, or regression with exposure controls. Avoid treating impressions as independent if assignment happened at user level.
Multiple testing inflates false positives. If testing 20 metrics at $\alpha=0.05$ , the chance of at least one false positive under all nulls is about $1-0.95^{20}\approx64\%$ . Use pre-registered primary metrics, Bonferroni/Holm for family-wise control, or Benjamini-Hochberg for false discovery rate.
Sequential peeking breaks nominal p-values. Looking every day and stopping once $p<0.05$ increases Type I error. Use alpha-spending, O’Brien-Fleming boundaries, group sequential designs, always-valid p-values, or Bayesian monitoring if the product team needs continuous launch decisions.
Variance reduction is often more valuable than simply waiting longer. CUPED uses pre-experiment covariates to reduce variance:
$Y^\*=Y-\theta(X-\bar X),\quad \theta=\frac{\operatorname{Cov}(Y,X)}{\operatorname{Var}(X)}.$
It works best when pre-period behavior strongly predicts post-period behavior, common for engagement and revenue metrics.
Assumption checks matter before interpreting significance. Validate randomization balance, sample ratio mismatch, logging completeness, exposure consistency, and metric definitions. A significant lift caused by instrumentation bugs, ramp bias, or misbucketed users is not evidence of product impact.
Not all effects are independent because social products have interference. If treatment changes what one user sees or sends to another, Stable Unit Treatment Value Assumption may fail. Consider cluster randomization, ego-network holdouts, geo experiments, or graph-aware analysis for messaging, sharing, groups, and feed interactions.

Worked example

“How would you design and analyze an A/B test for a new Facebook feature?”

A strong candidate would first clarify the product goal: is the feature intended to increase engagement, improve content quality, reduce negative feedback, or drive monetization, and what is the launch decision tied to? They would declare the experimental unit, likely user-level randomization, and ask whether there are network effects that could contaminate control users. The answer should be organized around five pillars: define primary and guardrail metrics, choose the randomization and exposure strategy, calculate sample size and test duration, specify the statistical test, and describe decision rules after launch.

For metrics, they might propose a primary metric such as daily active users, sessions per user, meaningful interactions, or feature adoption, plus guardrails like hide/report rate, latency, notification opt-outs, and long-term retention. For testing, they would aggregate outcomes at the user level and compare treatment versus control using Welch’s t-test or regression with covariates; for binary adoption they would use a proportion test or logistic regression. For power, they would estimate baseline variance from historical data and compute the sample needed to detect a business-relevant MDE at 80–90% power.

A key tradeoff to flag is speed versus reliability: shorter tests support fast iteration but may miss weekly seasonality, novelty effects, or delayed harms. They should also mention sample ratio mismatch checks, pre-experiment balance checks, and avoiding daily p-value peeking unless using sequential methods. A good close would be: “If I had more time, I would examine heterogeneous treatment effects by market, device, and prior engagement, and I would validate whether the short-term metric predicts the long-term outcome we actually care about.”

A second angle

“How would you estimate the sample size needed to detect a 1% change in click-through rate?”

The same statistical machinery applies, but the framing is narrower and more quantitative. The candidate should ask whether “1%” means a 1 percentage-point absolute change or a 1% relative change; for a baseline CTR of 10%, those are 11% versus 10.1%, requiring dramatically different sample sizes. They should identify CTR as a proportion or ratio metric and decide whether impressions or users are the correct independent unit. If randomization is user-level, using impression-level $n$ will overstate power because impressions within a user are correlated. The answer should end with a formula-driven estimate and a discussion of whether the resulting sample size is feasible within expected traffic and experiment duration.

Common pitfalls

Analytical mistake: treating statistical significance as product significance.
A tempting answer is, “The p-value is below 0.05, so we should ship.” A better answer separates detection from decision: quantify effect size, confidence interval, guardrail impact, practical value, and whether the result survives assumption checks and multiple-testing correction.

Communication mistake: jumping into formulas before defining the decision.
Candidates often start with a t-test without asking what metric matters, what the randomization unit is, or what harm must be avoided. Interviewers want to see a product-statistics bridge: “First I’d define the launch criterion and primary metric, then choose the test that matches the data-generating process.”

Depth mistake: ignoring dependence and interference.
In consumer social products, observations are rarely cleanly independent: one user can generate impressions, messages, comments, or notifications for another. A stronger answer calls out clustered data, repeated measures, and network spillovers, then proposes user-level aggregation, cluster-robust errors, or cluster/geo-level randomization when needed.

Connections

Expect pivots into experimentation design, causal inference, metric design, and regression modeling. If the interviewer pushes on validity, be ready for discussions of selection bias, difference-in-differences, instrumental variables, heterogeneous treatment effects, and network interference. If they push on execution, expect questions about logging, sample ratio mismatch, sequential testing, and experiment platform design.