Experiment Diagnostics, Power And Robust Inference

What's being tested

Interviewers are probing whether you can treat experiment results as statistical evidence, not just dashboard deltas. For a TikTok Data Scientist, this means diagnosing whether an observed change in CTR, conversion_rate, retention, watch_time, or posting behavior is causal, sufficiently powered, correctly randomized, and robust to known failure modes. You are expected to know when a simple two-sample test is valid, when clustered users or repeated looks break assumptions, and how to communicate uncertainty to product and engineering partners. The strongest answers combine inference, metric intuition, and practical experiment diagnostics.

Core knowledge

Difference-in-proportions testing is the default for binary outcomes such as conversion, activation, or retention. For treatment rate $\hat p_T$ and control rate $\hat p_C$ , use
$SE=\sqrt{\frac{\hat p_T(1-\hat p_T)}{n_T}+\frac{\hat p_C(1-\hat p_C)}{n_C}}$
and test $(\hat p_T-\hat p_C)/SE$ under large-sample normal approximation.
One-sample proportion tests compare an observed rate $\hat p$ against a benchmark $p_0$ , such as testing whether campaign conversion exceeds 60%. Use $SE=\sqrt{p_0(1-p_0)/n}$ for the null test, while confidence intervals often use $SE=\sqrt{\hat p(1-\hat p)/n}$ or Wilson intervals for better coverage.
Power analysis asks whether the experiment could detect a practically meaningful effect. For a two-arm test, approximate required sample size per group scales as
$n \approx \frac{2\sigma^2(z_{1-\alpha/2}+z_{1-\beta})^2}{\Delta^2}$
so halving the minimum detectable effect requires roughly 4x more sample.
Clustered randomization matters when assignment happens at creator, region, school, household, device group, or market level instead of user level. Observations inside a cluster are correlated, reducing independent information. Design effect is commonly approximated as $DE=1+(m-1)\rho$ , where $m$ is average cluster size and $\rho$ is intra-cluster correlation.
Effective sample size under clustering is roughly $n_{\text{eff}}=n/DE$ . If 1M events come from highly correlated clusters, the inferential sample size may be closer to thousands of clusters than millions of rows. The unit of analysis should align with the unit of randomization.
Cluster-robust standard errors estimate uncertainty while allowing arbitrary correlation within clusters. In practice, you can aggregate to cluster-level outcomes or use sandwich estimators, but you need enough clusters; with fewer than roughly 30–50 clusters, use caution, small-sample corrections, or randomization inference.
Sequential testing handles repeated peeking at experiment results. If you check daily and stop when $p<0.05$ , the false positive rate exceeds 5%. Use alpha-spending rules such as O’Brien-Fleming for conservative early looks or Pocock for more even thresholds across interim analyses.
Sample ratio mismatch is a core randomization diagnostic. If the planned allocation is 50/50 but observed traffic is 53/47, test counts with a chi-square goodness-of-fit test before interpreting effects. SRM can indicate eligibility bugs, logging gaps, treatment leakage, or platform-specific exposure differences.
Right-censoring appears in cohort behavior when some users have had less time to produce an outcome, such as 7-day posting after signup. Avoid comparing incomplete windows directly. Use fixed mature cohorts, survival analysis, or clearly defined observation windows such as posts_within_7d.
Heavy-tailed metrics like watch_time, revenue_per_user, comments, shares, or posts per creator often violate normal assumptions at the user level. Robust choices include winsorization, log transforms, bootstrap confidence intervals, quantile metrics, or pre-registered capped means.
Metric denominator discipline prevents misleading effects. conversion_rate can mean conversions per impression, per exposed user, per eligible user, or per session. Always identify numerator, denominator, eligibility, attribution window, and whether repeated exposures are counted.
Heterogeneity and segmentation are diagnostic tools, not fishing licenses. Segment by pre-specified dimensions such as country, app version, traffic source, device, creator size, or new versus existing users. Use interaction tests or hierarchical thinking instead of declaring wins from noisy subgroup deltas.

Worked example

For “Diagnose Traffic Allocation in A/B Test Results”, a strong candidate would start by clarifying the intended randomization unit, planned split, exposure definition, eligibility rules, ramp schedule, and whether the metric is based on assigned users or actually exposed users. In the first 30 seconds, say: “Before estimating treatment impact, I’d verify randomization integrity and data consistency, because allocation bias can invalidate causal interpretation.”

The answer can be organized into four pillars: first, check sample ratio mismatch by comparing observed assignment counts to expected allocation using a chi-square test; second, inspect whether mismatch is concentrated by country, device_os, app_version, traffic source, or experiment ramp date; third, compare pre-treatment covariates and historical metrics across arms to detect imbalance; fourth, evaluate whether exposure, logging, or eligibility definitions create post-randomization filtering.

A good candidate would distinguish between assignment-based analysis and exposure-based analysis. Assignment-based intent-to-treat preserves randomization but may dilute effects if many assigned users were never exposed. Exposure-based analysis can answer a product adoption question but may be biased if exposure itself is affected by treatment or user behavior.

One explicit tradeoff to flag: pausing analysis to debug SRM protects causal validity, but if the experiment is on a critical launch path, you may provide clearly labeled descriptive readouts while withholding causal claims. The close should be pragmatic: “If I had more time, I’d audit pre-period balance, run the same diagnostics on guardrail metrics like crash_rate and session_starts, and compare results across mature cohorts before recommending launch or rollback.”

A second angle

For “Compute cluster-aware significance and sequential corrections”, the same diagnostic mindset applies, but the main threat is not just bad allocation; it is invalid uncertainty. If the experiment randomizes by market or creator cluster but the analyst tests millions of user rows as independent, the $p$ -value will be artificially small. A strong answer identifies the randomization unit, estimates or reasons about the intra-cluster correlation, applies a design effect or cluster-robust standard error, and then adjusts thresholds for interim looks using an alpha-spending rule. The framing shifts from “is the treatment assignment trustworthy?” to “is the evidence calibrated after dependency and repeated monitoring?”

Common pitfalls

Pitfall: Treating row count as sample size when rows are correlated.

A tempting answer is: “We have 10 million events, so the test is highly powered.” That can be wrong if the experiment was randomized by 200 regions or creators, or if each user contributes many events. A better answer anchors inference to the randomization unit and discusses cluster-robust or user-level aggregation.

Pitfall: Reporting a significant lift before checking diagnostics.

Saying “treatment increased conversion_rate by 2%, $p<0.05$ , ship it” is incomplete. Interviewers expect you to check SRM, pre-period balance, logging consistency, novelty effects, metric denominators, and guardrails before making a causal recommendation. The analytical maturity is in knowing when not to trust a clean-looking result.

Pitfall: Overcomplicating the answer without product interpretation.

A purely mathematical answer full of formulas but no decision framing can miss the Data Scientist role. For TikTok-style experiments, connect the inference back to launch decisions: expected user impact, risk to retention or watch_time, robustness across major segments, and whether the observed effect is practically meaningful, not merely statistically significant.

Connections

Interviewers may pivot from here into causal inference, especially intent-to-treat versus treatment-on-the-treated, CUPED variance reduction, difference-in-differences, or interference/network effects. They may also ask about metric design, cohort retention analysis, anomaly diagnosis after a release, or robust evaluation of recommender and ranking changes.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts