Statistical Inference, Power, And Confidence Intervals

What's being tested

Interviewers are probing whether you can turn messy product or ads measurement problems into valid statistical inference: define the estimand, choose the right test or model, quantify uncertainty, and explain what the result means for a Pinterest decision. For a Data Scientist, this matters because experiments on ranking, recommendations, ads formats, creator surfaces, and notifications often have noisy metrics, heterogeneous users, multiple segments, and business pressure to ship quickly. You are expected to know when a simple two-sample t-test is sufficient, when power or clustering breaks the assumptions, and when bias matters more than variance. Strong answers combine formulas with judgment: “Here is the test, here are the assumptions, here is how I would validate them, and here is how I would communicate the confidence interval and decision risk.”

Core knowledge

Null hypothesis testing starts with an estimand, not a test. For an A/B test, define $\Delta = E[Y \mid T=1] - E[Y \mid T=0]$ , where Y could be save_rate, CTR, revenue_per_user, or 7d_retention; then choose the inference method based on assignment unit, outcome type, and dependence.
Two-sample t-tests compare means using $t=\frac{\bar X_T-\bar X_C}{\sqrt{s_T^2/n_T+s_C^2/n_C}}.$ Prefer Welch’s t-test when variances or sample sizes differ; the equal-variance pooled t-test is rarely necessary in product analytics and can be brittle.
Confidence intervals communicate effect size and uncertainty better than p-values alone. A 95% CI for a difference in means is approximately $\hat\Delta \pm 1.96 \cdot SE(\hat\Delta)$ ; if the CI is entirely practically positive, that is stronger than merely saying p < 0.05.
Power analysis estimates whether the test can detect a meaningful effect. For equal-sized groups, approximate per-arm sample size is $n \approx \frac{2(z_{1-\alpha/2}+z_{1-\beta})^2\sigma^2}{\delta^2},$ where $\delta$ is the minimum detectable effect. Halving MDE requires roughly 4x sample size.
Type I error is a false positive; Type II error is a false negative. Standard choices are $\alpha=0.05$ and power $1-\beta=0.8$ , but Pinterest decisions may require stricter thresholds for high-risk surfaces like ranking changes that affect home feed engagement or advertiser spend.
Ratio metrics such as CTR = clicks / impressions and conversion_rate = conversions / visitors are not always simple Bernoulli means. If the denominator varies heavily by user, analyze at the randomized unit level, use the delta method, bootstrap, or regression with robust standard errors.
Variance reduction methods like CUPED use pre-period covariates to reduce standard error: $Y^\* = Y - \theta(X-\bar X)$ , with $\theta = \operatorname{Cov}(Y,X)/\operatorname{Var}(X)$ . It is valid when the covariate is pre-treatment and correlated with the outcome.
Sequential testing is dangerous if you repeatedly peek and stop when p < 0.05; this inflates false positives. Use pre-planned looks, alpha spending, group sequential designs, or always-valid methods rather than ad hoc daily decision-making.
Sample ratio mismatch occurs when observed treatment/control allocation deviates from expected randomization. For a 50/50 split, test counts using a chi-square goodness-of-fit test; SRM can signal logging bugs, eligibility differences, or assignment leakage, and should be investigated before trusting treatment effects.
Clustered data violates independent-observation assumptions when units influence each other or share shocks. For ads, households, campaigns, creators, or boards, use cluster-robust standard errors, randomize at the cluster level, or aggregate to the assignment unit before inference.
Survey inference requires weighting when sample composition differs from the target population. If a Pinterest user survey over-represents one gender, age group, or region, use post-stratification weights, report design effects, and distinguish sampling bias from true subgroup differences.
Causal lift studies must separate randomized from observational evidence. A Conversion Lift Study can estimate incremental conversions if randomization is clean; observational attribution needs assumptions like ignorability, overlap, and no unmeasured confounding, often supported by matching, regression adjustment, or difference-in-differences.

Worked example

For “Design rigorous A/B test and causal analysis”, start by framing the first 30 seconds around the product decision: “What change are we testing, what is the primary metric, what is the randomization unit, and what launch decision will this support?” Then declare assumptions: users are independently assigned, treatment exposure is logged reliably, and the primary analysis will be intention-to-treat unless noncompliance is central. A strong answer can be organized into four pillars: experiment design, metric choice, power and duration, and validity checks.

For design, specify a user-level randomized controlled trial if the feature affects individual experience, but consider cluster randomization if there is interference through shared boards, creators, ads auctions, or social interactions. For metrics, choose one primary metric such as weekly_active_savers or save_rate, guardrails like hide_rate, session_length, or advertiser ROAS, and predefine segment cuts rather than fishing after results. For power, estimate baseline variance from historical data, choose an MDE that is product-relevant, and compute sample size before launch using the standard normal approximation or simulation for non-normal metrics.

The key tradeoff to flag is speed versus validity: shorter tests reduce opportunity cost but may underpower long-term outcomes, miss weekday seasonality, or overreact to novelty effects. You should explicitly say you would run SRM checks, inspect pre-treatment balance, use CUPED if a strong pre-period covariate exists, and avoid unplanned peeking unless sequential testing was designed up front. Close with: “If I had more time, I would add heterogeneous treatment effect analysis for new versus tenured users and validate whether the online metric movement predicts longer-term retention or marketplace value.”

A second angle

For “Analyze survey with gender imbalance”, the same inference ideas apply, but the main threat shifts from randomization variance to representativeness and bias. Instead of asking whether treatment and control are balanced by design, ask whether the survey sample matches the Pinterest population you want to generalize to across gender, age, geography, device, or engagement level. A naive mean from the survey may be precise but biased if one group is overrepresented and has systematically different responses. The answer should discuss post-stratification or inverse-probability weighting, weighted confidence intervals, and the increased variance from unequal weights. The close should separate descriptive claims about respondents from inferential claims about the broader user base.

Common pitfalls

Pitfall: Choosing a test by metric name instead of data-generating process.

A tempting answer is “use a t-test for means and a z-test for proportions” without asking about randomization unit, denominator variation, clustering, or sample size. A better answer says: “If the unit-level metric is approximately independent and sample size is large, Welch’s t-test is fine; if this is a ratio, clustered, or heavy-tailed metric, I would use unit-level aggregation, robust standard errors, bootstrap, or a transformation.”

Pitfall: Treating statistical significance as the launch decision.

Saying “p-value is below 0.05, so ship” misses practical significance, guardrails, power, novelty effects, and multiple testing. Interviewers want to hear whether the CI excludes effects that are too small to matter, whether business or user risk is asymmetric, and whether the result is stable across pre-specified segments.

Pitfall: Over-explaining formulas while under-explaining assumptions.

It is useful to know the t-statistic, but a DS interview is not a math exam alone. Spend equal time on what could invalidate the estimate: SRM, interference, logging gaps, noncompliance, selection bias, survey nonresponse, repeated peeking, and post-hoc segment mining.

Connections

This topic often pivots into causal inference, especially difference-in-differences, instrumental variables, matching, and regression adjustment when randomization is unavailable. It also connects to metric design, recommender-system evaluation, ads incrementality, and Bayesian experimentation, where the same uncertainty concepts appear with different decision rules.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts