Hypothesis Testing, Power, And Confidence Intervals

What's being tested

Interviewers are probing whether you can make statistically defensible product and model decisions when metrics are noisy, effects are small, and the business cost of being wrong is asymmetric. For a Meta Data Scientist, this shows up in Feed, Reels, Ads, integrity, and notification experiments where a 0.1% lift may matter, but false positives can degrade user trust or advertiser value. They are not testing memorized definitions; they are testing whether you can choose the right test, interpret uncertainty, reason about power, and explain what to do when results are inconclusive. A strong answer connects formulas to decisions: launch, iterate, collect more data, segment, or redesign the experiment.

Core knowledge

Null and alternative hypotheses should be stated in terms of the decision metric: $H_0: p_A = p_B$ versus $H_1: p_A \ne p_B$ for two-sided tests, or $H_1: p_A > p_B$ when superiority is pre-registered. Avoid switching direction after seeing the data.
Two-proportion z-tests are the standard tool for binary outcomes like click-through rate, conversion rate, or model accuracy. For arms with successes $x_A,x_B$ and sizes $n_A,n_B$ , use $\hat p_A = x_A/n_A$ , $\hat p_B = x_B/n_B$ , and test $\hat p_A - \hat p_B$ .
Pooled standard error is used under the null hypothesis for significance testing: $\hat p = (x_A + x_B)/(n_A + n_B)$ and $SE_{pooled} = \sqrt{\hat p(1 - \hat p)(1/n_A + 1/n_B)}.$ The z-statistic is $z = (\hat p_A - \hat p_B)/SE_{pooled}$ .
Unpooled standard error is typically used for confidence intervals: $SE_{unpooled} = \sqrt{\hat p_A(1 - \hat p_A)/n_A + \hat p_B(1 - \hat p_B)/n_B}.$ A 95% confidence interval is $(\hat p_A - \hat p_B) \pm 1.96 SE_{unpooled}$ .
Normal approximation conditions matter. The z-test is usually fine when each arm has at least about 10 expected successes and 10 expected failures. For tiny samples, rare events, or sparse safety metrics, use Fisher’s exact test, exact binomial intervals, or Bayesian beta-binomial modeling.
P-values are probabilities of observing data at least this extreme assuming $H_0$ is true, not probabilities that $H_0$ is true. A result with $p = 0.08$ is not “no effect”; it means the data are insufficient to reject at the chosen $\alpha$ threshold.
Statistical power is $1 - \beta$ , the probability of detecting an effect of a chosen size if it is real. For a two-arm proportion test, approximate per-arm sample size is $n \approx \frac{2\bar p(1 - \bar p)(z_{1 - \alpha/2} + z_{1 - \beta})^2}{\delta^2},$ where $\delta$ is the minimum detectable effect.
Minimum detectable effect should be business-driven, not reverse-engineered from sample size. For Ads conversion lift or Reels retention, ask what absolute or relative change would justify launch after considering user harm, advertiser value, engineering cost, and guardrail metrics.
Confidence intervals communicate effect size and uncertainty better than p-values alone. A non-significant test with CI $[-0.1\%, +2.0\%]$ is very different from one with CI $[-5.0\%, +5.0\%]$ ; the first may still support a cautious launch.
Multiple testing correction is needed when testing many metrics, segments, variants, or model comparisons. Bonferroni correction controls family-wise error using $\alpha/m$ , while Benjamini–Hochberg controls false discovery rate and is often more powerful for exploratory metric or cohort scans.
Sequential monitoring inflates Type I error if you repeatedly peek and stop when significant. Use pre-planned looks with alpha spending, group sequential designs, or always-valid methods; otherwise, a nominal 5% test can have much higher false-positive probability.
Practical significance differs from statistical significance. With tens of millions of impressions, a trivial lift can be highly significant but irrelevant; with small creator or advertiser cohorts, a large effect may be directionally important but underpowered.

Tip: In Meta-style interviews, say the test, the assumptions, the effect estimate, the uncertainty interval, and the decision implication. That sequence sounds much stronger than jumping straight to “p less than 0.05 means launch.”

Worked example

For Diagnose a non-significant experiment outcome, a strong candidate starts by clarifying the primary metric, unit of randomization, planned duration, target minimum detectable effect, and whether the test was two-sided or one-sided. They would state: “I would not conclude the feature has no impact; I would distinguish between evidence of no effect and lack of evidence due to noise or design issues.” The answer can be organized into four pillars: validate experiment setup, quantify uncertainty, assess power, and decide next steps under product risk. Under setup, check sample ratio mismatch, pre-period balance, exposure logging consistency, and whether users actually received the treatment; as a DS, you query these as diagnostic signals rather than redesigning pipelines. Under uncertainty, report the point estimate and confidence interval, not just $p > 0.05$ , because a wide interval may include meaningful upside or downside. Under power, compare achieved sample size and variance to the pre-experiment power calculation, and ask whether the observed effect is below the planned minimum detectable effect. A tradeoff to flag is asymmetric loss: for a notification feature, a false positive may annoy users, while for an internal ranking model with strong offline evidence, a false negative may delay value. The close should be decision-oriented: “If the CI excludes meaningful harm but includes meaningful upside, I might extend the test or run a targeted follow-up; if it is tightly centered near zero, I would deprioritize.” If given more time, also examine heterogeneous effects by high-intent users, new users, geography, or device, while controlling for multiple comparisons.

A second angle

For Determine Superiority of Model A Using Hypothesis Testing, the same ideas apply, but the framing is more direct: compare two models’ binary success proportions, such as correct classification rate, click prediction wins, or acceptance rate. The candidate should define whether the comparison is paired or independent; if the same examples are scored by both models, McNemar’s test may be better than a two-proportion z-test because outcomes are correlated. If the models are evaluated on separate randomized traffic buckets, a two-proportion z-test with pooled standard error is appropriate for the hypothesis test. The key constraint is avoiding overclaiming “Model A is better” from statistical significance alone; also ask whether the effect is practically meaningful and whether guardrail metrics such as latency-sensitive engagement, negative feedback, or fairness slices moved adversely. If many models or thresholds were tried offline, the nominal p-value is optimistic unless selection and multiple testing are addressed.

Common pitfalls

Pitfall: Treating a non-significant result as proof of no effect.

The tempting answer is “p is greater than 0.05, so the treatment does not work.” A better answer is “we failed to reject the null; I need the confidence interval and power to know whether meaningful effects remain plausible.”

Pitfall: Reporting formulas without tying them to the launch decision.

Interviewers do not want only $z = (\hat p_A - \hat p_B)/SE$ . They want to hear what the result means for DAU, retention, conversions, revenue, or user harm, and whether the risk tolerance supports launching, iterating, or collecting more data.

Pitfall: Ignoring dependence, repeated looks, or multiple comparisons.

A wrong-but-common move is to run many segment tests, pick the significant one, and present it as confirmatory. A stronger answer separates pre-registered primary analysis from exploratory follow-ups, then applies Bonferroni, Benjamini–Hochberg, or a follow-up validation experiment.

Connections

Interviewers may pivot from here into experiment design, especially randomization unit, interference, holdouts, and guardrail metrics. They may also ask about causal inference when randomization is imperfect, Bayesian A/B testing for posterior probabilities and decision loss, or metric design for ratio metrics, long-tailed outcomes, and heterogeneous treatment effects.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Practice questions

Related concepts