Chatbot Evaluation: Honesty and Relevance
Scenario
You are evaluating a customer-service chatbot. Define two events for any given answer:
-
H: the answer is honest, with P(H) = 0.7
-
R: the answer is relevant, with P(R) = 0.8
Assume H and R are independent.
Questions
-
What is the probability that an answer is both honest and relevant, P(H ∩ R)?
-
Given logs of 1,000 answers, how many would you expect to be neither honest nor relevant?
-
Describe how to run a hypothesis test to compare two LLMs’ relevance rates at significance level α = 0.05.