LLM Evaluation Frameworks For Product AI
Asked of: Data Scientist
Last updated

What's being tested
Ability to design practical, measurable evaluation frameworks for LLM-based product features: choose signals (human/automated), tradeoffs (cost vs fidelity), monitoring, and mitigation loops that align with user value and safety.
Core knowledge
- Offline vs online evaluation: offline benchmarks, synthetic tests, randomized A/B experiments, canary rollouts.
- Human evaluation methods: rating (Likert), pairwise preference, crowd vs expert, inter-annotator agreement (Cohen’s kappa).
- Automated metrics: BLEU/ROUGE (surface overlap), BERTScore, MoverScore, and LLM-based QA-eval (QAEval) strengths/limits.
- Factuality detectors: FEVER, FactCC, QA-based fact-checking, and retrieval-grounding recall for RAG.
- Safety and bias metrics: toxicity detection (Perspective API), demographic parity, false positive/negative rates, AUROC.
- Observability: telemetry (prompt + response logs), hallucination rate, latency, token cost, confidence calibration (ECE).
- Feedback loops: reward models (preference learning), RLHF, continuous labeling pipelines, and guardrail escalation rules.
Worked example
Sample interview prompt: "Design an evaluation framework to measure hallucination in a production chat assistant." Frame it by first defining hallucination operationally (unsupported factual claims vs hallucinated assertions). Propose mixed signals: automated QA-based checks (generate Qs from output, verify against trusted sources), retrieval-grounding rate for RAG, and a sampled human-annotation pipeline with clear labels and agreement targets. Prioritize metrics (false hallucination rate, precision of fact-checked assertions, user-facing severity), rollout plan (offline test → canary → A/B), and mitigation (confidence thresholds, source citations, fallback to retrieval).
A common pitfall
Relying solely on traditional n-gram metrics or a single automated scorer is tempting because it's cheap, but these correlate poorly with user satisfaction and factuality for open-ended outputs. Equally dangerous is optimizing for a single aggregate metric (e.g., lower hallucination rate) without considering user experience tradeoffs like verbosity, latency, or helpfulness, which can regress when you over-filter or increase conservative responses.
Further reading
- OpenAI Evals (GitHub) — pragmatic examples of automated and human eval harnesses.
- BERTScore (Zhang et al., 2020) — semantic similarity metric background and limitations.