Statistical Test for Comparing Mean Active Minutes
You have two weeks of experiment data for a new algorithm. The primary metric is user active minutes. Each user is assigned to control or treatment for the full duration.
Analyze the per-user total or average active minutes over the two-week window.
Constraints & Assumptions
-
The unit of analysis should match the randomization unit.
-
Active minutes are nonnegative and likely right-skewed.
-
Explain assumptions and validation checks.
-
Discuss multiple comparisons for five secondary metrics.
Clarifying Questions to Ask
-
Are users randomized independently?
-
Do we have one row per user or repeated daily rows?
-
Are group sizes and variances equal?
-
Is the business interested in means, medians, or distributional effects?
What a Strong Answer Covers
-
Recommended test: Welch's two-sample t-test on per-user aggregated active minutes for mean comparison.
-
Alternatives: permutation test for the mean, bootstrap confidence intervals, Mann-Whitney as a distributional/median robustness check, and cluster-robust or mixed models for repeated daily data.
-
Assumptions: stable randomization, independent users, no major interference, correct unit of analysis, enough sample size for CLT, and no severe data-quality issues.
-
Validation: SRM test, covariate balance, histogram/tails, variance comparison, time-series by arm, and instrumentation checks.
-
p-value and 95% CI computation using mean difference, standard error, Welch-Satterthwaite degrees of freedom, and business interpretation.
-
Type I and Type II error explanation in the product context.
-
Multiple-comparison adjustment using Holm-Bonferroni, Benjamini-Hochberg, or pre-specified metric hierarchy.
Follow-up Questions
-
Why use Welch's t-test instead of Student's t-test?
-
What if one user has extreme minutes?
-
How would you analyze daily repeated measurements?
-
How would you communicate a statistically significant but tiny effect?