Hypothesis Testing, Power, And Confidence Intervals
Asked of: Data Scientist
Last updated

What's being tested
Interviewers are probing whether you can make statistically defensible product and model decisions when metrics are noisy, effects are small, and the business cost of being wrong is asymmetric. For a Meta Data Scientist, this shows up in Feed, Reels, Ads, integrity, and notification experiments where a 0.1% lift may matter, but false positives can degrade user trust or advertiser value. They are not testing memorized definitions; they are testing whether you can choose the right test, interpret uncertainty, reason about power, and explain what to do when results are inconclusive. A strong answer connects formulas to decisions: launch, iterate, collect more data, segment, or redesign the experiment.
Core knowledge
-
Null and alternative hypotheses should be stated in terms of the decision metric: versus for two-sided tests, or when superiority is pre-registered. Avoid switching direction after seeing the data.
-
Two-proportion z-tests are the standard tool for binary outcomes like click-through rate, conversion rate, or model accuracy. For arms with successes and sizes , use , , and test .
-
Pooled standard error is used under the null hypothesis for significance testing: and The z-statistic is .
-
Unpooled standard error is typically used for confidence intervals: A 95% confidence interval is .
-
Normal approximation conditions matter. The z-test is usually fine when each arm has at least about 10 expected successes and 10 expected failures. For tiny samples, rare events, or sparse safety metrics, use Fisher’s exact test, exact binomial intervals, or Bayesian beta-binomial modeling.
-
P-values are probabilities of observing data at least this extreme assuming is true, not probabilities that is true. A result with is not “no effect”; it means the data are insufficient to reject at the chosen threshold.
-
Statistical power is , the probability of detecting an effect of a chosen size if it is real. For a two-arm proportion test, approximate per-arm sample size is where is the minimum detectable effect.
-
Minimum detectable effect should be business-driven, not reverse-engineered from sample size. For
Adsconversion lift orReelsretention, ask what absolute or relative change would justify launch after considering user harm, advertiser value, engineering cost, and guardrail metrics. -
Confidence intervals communicate effect size and uncertainty better than p-values alone. A non-significant test with CI is very different from one with CI ; the first may still support a cautious launch.
-
Multiple testing correction is needed when testing many metrics, segments, variants, or model comparisons. Bonferroni correction controls family-wise error using , while Benjamini–Hochberg controls false discovery rate and is often more powerful for exploratory metric or cohort scans.
-
Sequential monitoring inflates Type I error if you repeatedly peek and stop when significant. Use pre-planned looks with alpha spending, group sequential designs, or always-valid methods; otherwise, a nominal 5% test can have much higher false-positive probability.
-
Practical significance differs from statistical significance. With tens of millions of impressions, a trivial lift can be highly significant but irrelevant; with small creator or advertiser cohorts, a large effect may be directionally important but underpowered.
Tip: In Meta-style interviews, say the test, the assumptions, the effect estimate, the uncertainty interval, and the decision implication. That sequence sounds much stronger than jumping straight to “p less than 0.05 means launch.”
Worked example
For Diagnose a non-significant experiment outcome, a strong candidate starts by clarifying the primary metric, unit of randomization, planned duration, target minimum detectable effect, and whether the test was two-sided or one-sided. They would state: “I would not conclude the feature has no impact; I would distinguish between evidence of no effect and lack of evidence due to noise or design issues.” The answer can be organized into four pillars: validate experiment setup, quantify uncertainty, assess power, and decide next steps under product risk. Under setup, check sample ratio mismatch, pre-period balance, exposure logging consistency, and whether users actually received the treatment; as a DS, you query these as diagnostic signals rather than redesigning pipelines. Under uncertainty, report the point estimate and confidence interval, not just , because a wide interval may include meaningful upside or downside. Under power, compare achieved sample size and variance to the pre-experiment power calculation, and ask whether the observed effect is below the planned minimum detectable effect. A tradeoff to flag is asymmetric loss: for a notification feature, a false positive may annoy users, while for an internal ranking model with strong offline evidence, a false negative may delay value. The close should be decision-oriented: “If the CI excludes meaningful harm but includes meaningful upside, I might extend the test or run a targeted follow-up; if it is tightly centered near zero, I would deprioritize.” If given more time, also examine heterogeneous effects by high-intent users, new users, geography, or device, while controlling for multiple comparisons.
A second angle
For Determine Superiority of Model A Using Hypothesis Testing, the same ideas apply, but the framing is more direct: compare two models’ binary success proportions, such as correct classification rate, click prediction wins, or acceptance rate. The candidate should define whether the comparison is paired or independent; if the same examples are scored by both models, McNemar’s test may be better than a two-proportion z-test because outcomes are correlated. If the models are evaluated on separate randomized traffic buckets, a two-proportion z-test with pooled standard error is appropriate for the hypothesis test. The key constraint is avoiding overclaiming “Model A is better” from statistical significance alone; also ask whether the effect is practically meaningful and whether guardrail metrics such as latency-sensitive engagement, negative feedback, or fairness slices moved adversely. If many models or thresholds were tried offline, the nominal p-value is optimistic unless selection and multiple testing are addressed.
Common pitfalls
Pitfall: Treating a non-significant result as proof of no effect.
The tempting answer is “p is greater than 0.05, so the treatment does not work.” A better answer is “we failed to reject the null; I need the confidence interval and power to know whether meaningful effects remain plausible.”
Pitfall: Reporting formulas without tying them to the launch decision.
Interviewers do not want only . They want to hear what the result means for DAU, retention, conversions, revenue, or user harm, and whether the risk tolerance supports launching, iterating, or collecting more data.
Pitfall: Ignoring dependence, repeated looks, or multiple comparisons.
A wrong-but-common move is to run many segment tests, pick the significant one, and present it as confirmatory. A stronger answer separates pre-registered primary analysis from exploratory follow-ups, then applies Bonferroni, Benjamini–Hochberg, or a follow-up validation experiment.
Connections
Interviewers may pivot from here into experiment design, especially randomization unit, interference, holdouts, and guardrail metrics. They may also ask about causal inference when randomization is imperfect, Bayesian A/B testing for posterior probabilities and decision loss, or metric design for ratio metrics, long-tailed outcomes, and heterogeneous treatment effects.
Further reading
-
Trustworthy Online Controlled Experiments, Kohavi, Tang, and Xu — Practical reference for A/B testing, metrics, power, pitfalls, and decision-making in large-scale product experimentation.
-
Benjamini and Hochberg (1995), Controlling the False Discovery Rate — Seminal paper behind FDR control, useful when discussing many metrics or segments.
-
Causal Inference for Statistics, Social, and Biomedical Sciences, Imbens and Rubin — Deeper foundation for potential outcomes, randomization, and what experiments identify.
Practice questions
- Estimate variance for ratio metricsMeta · Data Scientist · Onsite · hard
- Diagnose a non-significant experiment outcomeMeta · Data Scientist · Onsite · medium
- Compute p-values, power, and adjust errorsMeta · Data Scientist · Onsite · hard
- Test two models' proportions for significanceMeta · Data Scientist · Onsite · Medium
- Compute sample size and test duration correctlyMeta · Data Scientist · Technical Screen · hard
- Quantify launch decision with tests and guardrailsMeta · Data Scientist · Technical Screen · Medium
- Construct a 95% Confidence Interval for Comment CountsMeta · Data Scientist · Onsite · medium
- Evaluate Marketing Campaign's Click-Through Rate EffectivenessMeta · Data Scientist · Onsite · medium
Related concepts
- Statistical Inference, Power, And Confidence IntervalsStatistics & Math
- Statistical Inference, Hypothesis Tests, And Power
- Hypothesis Tests, Confidence Intervals, And P-Values
- Statistical Inference, Hypothesis Testing, And Power
- Power Analysis And Statistical InferenceStatistics & Math
- Central Limit Theorem, Confidence Intervals, And PowerStatistics & Math