Hypothesis Tests, Confidence Intervals, And P-Values

What's being tested

You need to show that you can reason from a product claim to a statistically valid decision: define the estimand, choose the right test, interpret uncertainty, and avoid overclaiming. Meta-style questions often care less about plugging into a formula and more about whether you recognize issues like randomization unit, metric distribution, multiple testing, and practical vs statistical significance.

Core knowledge

A p-value is (P(\text{data as extreme} \mid H_0)), not the probability the null hypothesis is true.
A 95% confidence interval means the procedure covers the true parameter 95% of repeated samples, not a 95% posterior probability.
For A/B tests, randomization unit must match interference risk: user-level for feed changes, session-level only if carryover is negligible.
Use z/t-tests for large-sample mean differences; use proportion tests for binary outcomes like click-through or conversion.
Heavy-tailed metrics like revenue or time spent often need winsorization, bootstrap CIs, delta method, or nonparametric checks.
Statistical significance does not imply product significance; compare effect size to launch threshold, cost, and guardrail metrics.
Watch for peeking, multiple comparisons, sample ratio mismatch, novelty effects, and heterogeneous treatment effects.

Worked example

“Is a +0.3% lift in CTR statistically significant?”

A strong answer starts by clarifying whether +0.3% means absolute percentage points or relative lift, and what the unit of analysis is: impressions, sessions, or users. Then frame the null and alternative, e.g. (H_0: p_T - p_C = 0), and use a two-proportion test if CTR is binary at the impression level, with clustered/user-level variance if users contribute many impressions. Before interpreting the p-value, check experiment validity: randomization, sample ratio mismatch, logging bugs, and whether the test ran for a full business cycle. Finally, pair the p-value with a confidence interval and ask whether the interval excludes effects too small to matter or harmful guardrail movements.

A common pitfall

A tempting but weak answer is to say “p < 0.05, so we should launch.” That skips the estimand, assumes independent observations, and ignores whether the lift is meaningful for the product. At Meta scale, tiny effects can be highly statistically significant while being operationally irrelevant or offset by guardrail regressions like increased hides, unfollows, latency, or long-term retention loss. Conversely, a non-significant result is not proof of no effect; it may indicate low power, high variance, or an effect smaller than the experiment was designed to detect.

What's being tested

Core knowledge

Worked example

“Is a +0.3% lift in CTR statistically significant?”

A common pitfall

Further reading

Related concepts