LLM Evaluation Frameworks For Product AI — Tech Interview Concept

What's being tested
Ability to design practical, measurable evaluation frameworks for LLM-based product features: choose signals (human/automated), tradeoffs (cost vs fidelity), monitoring, and mitigation loops that align with user value and safety.

Core knowledge

Offline vs online evaluation: offline benchmarks, synthetic tests, randomized A/B experiments, canary rollouts.
Human evaluation methods: rating (Likert), pairwise preference, crowd vs expert, inter-annotator agreement (Cohen’s kappa).
Automated metrics: BLEU/ROUGE (surface overlap), BERTScore, MoverScore, and LLM-based QA-eval (QAEval) strengths/limits.
Factuality detectors: FEVER, FactCC, QA-based fact-checking, and retrieval-grounding recall for RAG.
Safety and bias metrics: toxicity detection (Perspective API), demographic parity, false positive/negative rates, AUROC.
Observability: telemetry (prompt + response logs), hallucination rate, latency, token cost, confidence calibration (ECE).
Feedback loops: reward models (preference learning), RLHF, continuous labeling pipelines, and guardrail escalation rules.

Worked example
Sample interview prompt: "Design an evaluation framework to measure hallucination in a production chat assistant." Frame it by first defining hallucination operationally (unsupported factual claims vs hallucinated assertions). Propose mixed signals: automated QA-based checks (generate Qs from output, verify against trusted sources), retrieval-grounding rate for RAG, and a sampled human-annotation pipeline with clear labels and agreement targets. Prioritize metrics (false hallucination rate, precision of fact-checked assertions, user-facing severity), rollout plan (offline test → canary → A/B), and mitigation (confidence thresholds, source citations, fallback to retrieval).

A common pitfall
Relying solely on traditional n-gram metrics or a single automated scorer is tempting because it's cheap, but these correlate poorly with user satisfaction and factuality for open-ended outputs. Equally dangerous is optimizing for a single aggregate metric (e.g., lower hallucination rate) without considering user experience tradeoffs like verbosity, latency, or helpfulness, which can regress when you over-filter or increase conservative responses.

Further reading

OpenAI Evals (GitHub) — pragmatic examples of automated and human eval harnesses.
BERTScore (Zhang et al., 2020) — semantic similarity metric background and limitations.

Related concepts