Statistical Inference, Power, And Metric Uncertainty
Asked of: Data Scientist
Last updated
What's being tested
Meta Data Scientists are expected to reason from noisy user-level data to defensible product conclusions: estimate quantities like average comments per `DAU`, quantify uncertainty, compare model or product variants, and avoid false discoveries. Interviewers are probing whether you know when Central Limit Theorem approximations are valid, how to construct and interpret confidence intervals, how to handle skewed or count-based metrics, and how to design tests without inflating false-positive rates. They also care whether you can communicate assumptions clearly: independence, random sampling, treatment assignment, metric definition, and whether the uncertainty is statistical, measurement-related, or causal. Strong answers connect formulas to product decisions, such as whether a ranking change, comment composer tweak, or chatbot model is ready to ship.
Core knowledge
-
Expectation is the long-run average of a random variable: for discrete outcomes. For user comments, the sample mean estimates expected comments per user or per
`DAU`, depending on the sampling unit. -
Sample variance uses , not , when estimating population variance from data: Use the population standard deviation only when the full population distribution is known, which is rare in interview settings.
-
Central Limit Theorem says is approximately normal for large if observations are independent enough and variance is finite: It can work well for user-level metrics with large
`DAU`, but heavy tails and clustering slow convergence. -
A 95% confidence interval for a mean is commonly when is large. For small samples, use a t-interval: , especially if variance is estimated and normality is plausible.
-
Count data like comments per user are often skewed, zero-inflated, and overdispersed relative to a Poisson model, where . A negative binomial or nonparametric approach is often more realistic if a few highly active users dominate variance.
-
Bootstrap inference resamples users with replacement and recomputes the statistic, producing an empirical uncertainty distribution. For most interview-scale answers, 1,000–10,000 bootstrap replicates is enough; for very large samples, resample at the user level rather than event level to preserve the analysis unit.
-
Independence is an assumption, not a given. User outcomes can be correlated through social graph effects, shared content, geography, or time shocks. If treatment is assigned by cluster, session, page, or conversation, standard errors must reflect that assignment unit.
-
Power is the probability of detecting a true effect: . For a two-sample mean comparison with equal group sizes, approximate required per-arm sample size is where is the minimum detectable effect.
-
Hypothesis testing separates effect size from uncertainty. A tiny lift in comments can be statistically significant with millions of users but product-irrelevant. Always report both the estimate and interval, for example “+0.3% comments per
`DAU`, 95% CI [+0.1%, +0.5%].” -
Sequential testing requires a pre-planned correction if you repeatedly peek at results. Pocock boundaries spend alpha relatively evenly; O’Brien–Fleming boundaries are stricter early and closer to conventional thresholds later. Naively stopping when inflates false positives.
-
Always-valid inference methods such as mixture SPRT, e-values, or confidence sequences allow continuous monitoring while controlling error rates under specified assumptions. In a DS interview, you do not need to derive them fully, but you should know why they avoid p-hacking better than ad hoc peeking.
-
Joint probability questions often test whether you distinguish independence from correlation. If “honest” and “relevant” chatbot answers are independent, ; without independence, use and ask how labels were collected.
Worked example
For Analyze Central Limit Theorem in User Comment Distribution, a strong candidate first clarifies the sampling unit: “Are we sampling users from a day’s `DAU`, sessions, or comments? Is the target average comments per active user, total comments, or expected comments for a randomly chosen user?” They would declare assumptions: observations are user-level, sampled randomly from the relevant population, and each user contributes one count of comments for the day.
The answer should then be organized around four pillars. First, define the estimator: estimates expected comments per active user, while estimates total comments for a population of size only if the sample represents that population. Second, discuss variability using , emphasizing that skewed counts can still have an approximately normal mean when is large. Third, build a confidence interval with either a normal or t critical value depending on sample size. Fourth, interpret the interval in product language: repeated samples would produce intervals covering the true mean about 95% of the time, not “there is a 95% probability this specific interval contains the truth.”
One tradeoff to flag is whether to rely on the CLT or use a bootstrap. If the distribution has many zeros and a few extreme commenters, the bootstrap may better reflect uncertainty for medians, percentiles, or trimmed means, while the CLT is still usually reasonable for the mean at large scale. A strong close would be: “If I had more time, I’d inspect the histogram, top-user contribution, day-of-week effects, and whether the target is user-level average or platform-level total.”
A second angle
For Apply sequential testing without p-hacking, the same uncertainty concepts apply, but the main risk shifts from estimating one interval to controlling error under repeated decisions. A candidate should immediately ask how often results will be checked, whether the stopping rule is pre-registered, and whether the metric is primary or one of many guardrails. Instead of a fixed-horizon test, they should propose an alpha-spending plan such as Pocock or O’Brien–Fleming, or an always-valid method if continuous monitoring is operationally necessary. The transferable idea is that uncertainty statements are only valid under their design assumptions; changing the stopping rule after seeing data changes the meaning of the p-value. The product framing is also different: early stopping may save users from a harmful launch, but it usually costs power or requires stricter evidence.
Common pitfalls
Pitfall: Treating the CLT as “the data are normal.”
The CLT is about the sampling distribution of the mean, not the raw distribution of comments. A count distribution can be extremely skewed while the mean is approximately normal; a better answer says, “The user-level counts are not normal, but may be approximately normal if is large and dependence is limited.”
Pitfall: Giving a formula without defining the unit of analysis.
Saying is incomplete if might mean comments, users, sessions, or conversations. Meta interviewers expect you to anchor metrics to entities like user-day, `DAU`, treatment arm, or labeled chatbot response; otherwise your standard error may be artificially small.
Pitfall: Confusing statistical significance with launch readiness.
A -value below 0.05 does not mean the effect is large, causal under all conditions, or safe for all segments. Stronger communication pairs the estimate with confidence intervals, practical significance, guardrail metrics, and whether the test design controlled for peeking or multiple comparisons.
Connections
Interviewers may pivot from here to A/B testing, causal inference, multiple hypothesis correction, metric design, or model evaluation for ranking and chatbot systems. Be ready to discuss variance reduction methods like CUPED, heterogeneous treatment effects across cohorts, and how offline evaluation metrics connect to online user outcomes.
Further reading
-
Trustworthy Online Controlled Experiments — Kohavi, Tang, and Xu — Practical treatment of experimentation, metrics, power, and common online testing failure modes.
-
All of Statistics — Larry Wasserman — Concise reference for estimation, confidence intervals, hypothesis testing, bootstrap, and asymptotic inference.
-
Sequential Analysis — Abraham Wald — Classic foundation for sequential probability ratio testing and the logic behind valid early stopping.
Featured in interview prep guides
Practice questions
- Compute sample size and test durationMeta · Data Scientist · Onsite · Medium
- Model comment count distribution and validate assumptionsMeta · Data Scientist · Onsite · Medium
- Apply sequential testing without p-hackingMeta · Data Scientist · Onsite · hard
- Estimate variance for ratio metricsMeta · Data Scientist · Onsite · hard
- Diagnose a non-significant experiment outcomeMeta · Data Scientist · Onsite · medium
- Compute sample size and test duration correctlyMeta · Data Scientist · Technical Screen · hard
- Analyze DAU comments distribution and resamplingMeta · Data Scientist · Onsite · Medium
- Model session times and comments with exponential/PoissonMeta · Data Scientist · Onsite · medium
- Derive expected meetings given nonempty roomMeta · Data Scientist · Onsite · medium
- Estimate CTR lift with binomial tests and errorsMeta · Data Scientist · Onsite · hard
- Model comment counts and detect anomaliesMeta · Data Scientist · Onsite · hard
- Evaluate Marketing Campaign's Click-Through Rate EffectivenessMeta · Data Scientist · Onsite · medium
Related concepts
- Statistical Inference, Power, And Confidence IntervalsStatistics & Math
- Hypothesis Testing, Power, And Confidence Intervals
- Power Analysis And Statistical InferenceStatistics & Math
- Central Limit Theorem, Confidence Intervals, And PowerStatistics & Math
- Statistical Inference, Hypothesis Testing, And Power
- Statistical Inference, Hypothesis Tests, And Power