Scenario
A search feature marks a user session as a success only if both the relevancy and accuracy binary flags equal 1.
Two ranking models were A/B tested independently with equal traffic:
-
Model A: 100 users, 90 successes
-
Model B: 100 users, 85 successes
Task
Using only these data, determine whether you can conclude that Model A is better than Model B at a 5% significance level.
Provide:
-
The appropriate statistical test and hypotheses.
-
The pooled variance, standard error, z-statistic, and p-value.
-
A 95% confidence interval for the difference in success rates.
-
An interpretation (statistical and practical significance).
-
A brief comment on statistical power and the sample size needed to detect a 5 percentage-point lift.
Hints
-
Use a two-proportion z-test (pooled standard error for the hypothesis test).
-
For the 95% CI, use the unpooled standard error.
-
Comment on power: whether n=100 per arm is enough to detect a 5 pp difference.