Evaluating Good-Response Rates for Chatbot Outputs
Context
You are evaluating chatbot/LLM responses. Treat each response as a Bernoulli trial (good vs not good). Unless otherwise noted, assume independence across responses.
Questions
-
If the probability any response is good is x, and the first three responses were good, what is the probability the fourth response will be good?
-
Model A shows 70% good responses and Model B shows 80% good responses, based on n1 and n2 evaluated responses, respectively. Using a two-proportion z-test at a 5% significance level:
-
Test whether the difference is statistically significant.
-
Report the test statistic and p-value.
-
If n1 and n2 are not provided, show the general formula and then illustrate with an example (e.g., n1 = n2 = 100).