LLM Evaluation Frameworks

What's being tested

Interviewers are probing whether you can turn an ambiguous generative-AI product goal into a reliable measurement system, not whether you can list generic metrics like “accuracy” or “BLEU.” For a Meta Data Scientist, the core skill is connecting offline model quality, human preference, safety, latency/cost, and online user/business outcomes into one decision framework. They want to see if you understand that LLM evaluation is multi-objective, noisy, slice-dependent, and vulnerable to Goodhart’s Law. A strong answer balances statistical rigor with product reality: what to measure before launch, what to A/B test after launch, and what guardrails would block shipment even if engagement improves.

Core knowledge

LLM evaluation should be structured around task success, user value, safety, and system constraints. For a Meta product, that could mean answer helpfulness, factuality, refusal correctness, toxicity, policy compliance, latency, inference cost, and downstream metrics like retention, creation, sharing, or support deflection.
Separate offline evaluation, human evaluation, and online experimentation. Offline evals catch regressions quickly; human evals measure nuanced quality; A/B tests estimate real user impact. A model can win offline benchmarks and still hurt product outcomes due to bad UX, slow latency, or mismatched user intent.
Use golden datasets for repeatable regression testing. These should include high-volume intents, rare but critical edge cases, adversarial prompts, policy-sensitive examples, multilingual examples, and demographic/geographic slices. Refresh periodically because LLMs and user behavior drift; otherwise models overfit stale evals.
For open-ended generation, prefer rubric-based human ratings or pairwise comparisons over single absolute scores. Pairwise judgments are often more reliable: ask “which response is better?” and aggregate with win rate, Elo, or Bradley-Terry models:
$P(i \succ j)=\frac{e^{\theta_i}}{e^{\theta_i}+e^{\theta_j}}$
Track inter-rater reliability when using human labels. Cohen’s kappa works for two raters; Krippendorff’s alpha handles multiple raters and missingness. Low agreement means the rubric is ambiguous, the task is subjective, or the annotators need calibration—not necessarily that the model is bad.
LLM-as-judge can scale evaluations but must be validated against humans. It is useful for first-pass scoring, regression checks, and pairwise preference at large $N$ , but can be biased toward verbosity, familiar model families, or surface fluency. Always benchmark judge agreement against expert labels on a holdout set.
Standard NLP metrics have limited but specific uses. BLEU/ROUGE are weak for conversational quality but can help with constrained summarization or translation. Exact match/F1 work for QA with known answers. Embedding similarity can capture semantic closeness, but may miss factual errors, hallucinations, and policy violations.
Factuality requires specialized checks. For retrieval-augmented generation, evaluate retrieval recall@k, citation precision, answer groundedness, and faithfulness: whether claims are supported by retrieved documents. A common failure is high answer helpfulness but low attribution accuracy.
Safety metrics should be treated as guardrails, not averaged away. A 2% harmful-response rate may be unacceptable even if average helpfulness improves. Track toxicity, self-harm guidance, hate/harassment, sexual content, privacy leakage, jailbreak susceptibility, and refusal quality separately by risk category.
Online LLM experiments need both primary metrics and guardrails. Primary metrics may be task completion, repeat usage, user satisfaction, or creation rate. Guardrails include report/block rates, negative feedback, latency p95/p99, cost per session, crash/error rate, and integrity escalations.
Sample size and variance are often worse than expected because LLM outcomes are heterogeneous. For binary metrics, approximate experiment size per arm with
$n \approx \frac{2(z_{\alpha/2}+z_\beta)^2p(1-p)}{\delta^2}$
but for rare safety events, use targeted stress tests and sequential monitoring rather than relying only on broad A/B tests.
Evaluation must be sliced. Overall win rate can hide regressions for teens, creators, advertisers, low-resource languages, accessibility users, political content, or new users. Meta-scale products require “do no harm” checks on critical segments, not only global averages.

Worked example

Design an Evaluation Framework for an LLM-Powered Product Feature

A strong candidate would start by clarifying the product surface: “Is this a consumer assistant, content summarizer, ad-copy generator, or support bot? What user action defines success, and what are the unacceptable failure modes?” They would then state assumptions, for example: “I’ll assume this is a comment-thread summarizer in Facebook Groups, where the goal is to help users understand long discussions faster without misrepresenting the conversation.” The answer should be organized into four pillars: offline model-quality evaluation, human preference and safety review, online experiment design, and post-launch monitoring.

For offline quality, they would propose a curated dataset of representative threads plus edge cases such as controversial topics, multilingual comments, sarcasm, deleted comments, and misinformation-sensitive discussions. For human evaluation, they would define rubrics for accuracy, coverage, concision, neutrality, and harm, then use pairwise comparisons between candidate models and a baseline. For online testing, they would propose an A/B test with primary metrics like summary expansion rate, time-to-understanding proxy, return visits, or user satisfaction, plus guardrails such as report rate, hide/mute rate, latency p95, and policy escalations. A key tradeoff to flag is concision versus faithfulness: shorter summaries may improve consumption but increase omission or distortion risk. The close should show maturity: “If I had more time, I’d add slice-level analysis for language, group type, and sensitive topics, and build a human-review loop for cases where confidence or grounding is low.”

A second angle

Compare Two LLMs for a Generative Assistant

Here the same evaluation logic applies, but the decision is more directly about model selection than product launch. The candidate should frame the comparison as multi-objective: model A may have better helpfulness while model B has lower latency, cost, or safety risk. Instead of asking only “which model has higher average score,” they should propose a benchmark set, pairwise human preference testing, LLM-as-judge validation, adversarial safety testing, and production shadow traffic. The constraint may be that the higher-quality model is too expensive for all traffic, leading to a routing strategy: use the stronger model for complex or high-risk prompts and a cheaper model for simple prompts. The strongest answer recognizes that “best model” depends on prompt distribution, user segment, and product objective.

Common pitfalls

A common analytical mistake is collapsing everything into one average “quality score.” This is tempting because it makes ranking models easy, but it can hide catastrophic failures on safety, factuality, or minority-language slices. A better answer treats some metrics as decision guardrails and reports a metric dashboard with confidence intervals and segment cuts.

A communication mistake is jumping straight into benchmarks like MMLU, HELM, or MT-Bench without tying them to the product. General benchmarks can be useful for sanity checks, but Meta interviewers care whether the evaluation reflects actual users, content policies, latency constraints, and business outcomes. Start with the product goal and failure modes, then choose metrics.

A depth mistake is over-trusting LLM-as-judge. Saying “I’d have GPT-4 score all outputs” sounds scalable but incomplete. A stronger response explains judge calibration against human labels, checks for verbosity and position bias, uses blind randomized response order, and reserves human expert review for high-risk categories.

Connections

Interviewers may pivot from this topic into A/B testing, metric design, ranking evaluation, human labeling systems, or responsible AI. If they push on causal validity, expect questions about experiment power, heterogeneous treatment effects, novelty effects, and interference. If they push on ML depth, expect follow-ups on RAG evaluation, calibration, active learning, and model monitoring.