LLM Evaluation: Offline, Online, And Human Judgment

What's being tested

Interviewers are probing whether you can design an evaluation system for generative AI that connects model quality to user value, product risk, and launch decisions. The key skill is not reciting BLEU, ROUGE, or human eval definitions; it is choosing the right mix of offline metrics, online experiments, and human judgment under ambiguity. For Meta, this matters because LLM features affect billions of user interactions, recommendation surfaces, ads, messaging, creator tools, and integrity systems where small quality or safety regressions can have large product and reputational impact. A strong candidate can reason about metric validity, evaluator bias, sample construction, causal inference, and the gap between benchmark performance and real user outcomes.

Core knowledge

Offline evaluation is useful for fast iteration, regression detection, and model comparison before exposure to users. Typical datasets include golden human-labeled sets, production logs, adversarial prompts, policy-violating examples, and long-tail slices by language, geography, device, intent, and user cohort.
Automatic text metrics are limited. BLEU and ROUGE measure n-gram overlap and work better for constrained generation like summarization than open-ended chat. Embedding similarity, BERTScore, and LLM-as-judge can capture semantics but may miss factuality, safety, tone, and product-specific utility.
Human evaluation should define a rubric with explicit dimensions: helpfulness, correctness, factuality, relevance, safety, harmlessness, tone, completeness, latency tolerance, and policy compliance. Use Likert ratings, pairwise preference, or best-worst scaling; pairwise comparisons often have higher inter-rater reliability than absolute 1–5 scores.
Inter-rater agreement matters. For categorical labels, Cohen’s kappa or Fleiss’ kappa adjusts for chance agreement:
$\kappa = \frac{p_o - p_e}{1 - p_e}$
Low agreement may mean the model is ambiguous, the rubric is vague, or the task requires domain expertise.
LLM-as-judge can scale evaluation but must itself be evaluated. It is vulnerable to position bias, verbosity bias, self-preference, prompt sensitivity, and model-family bias. Mitigations include randomizing answer order, calibrating against human labels, using pairwise judging, requiring rationales, and tracking judge-human correlation.
Online evaluation should tie to user and business outcomes: retention, session depth, message sends, successful task completion, reduced reformulation, saves/shares, creator satisfaction, complaint rate, hide/report rate, and downstream integrity metrics. For LLM products, “thumbs up/down” is useful but often sparse and biased.
A/B testing remains the gold standard for causal product impact. Estimate treatment effect as
$\Delta = \bar{Y}_T - \bar{Y}_C$
with confidence intervals, guardrail metrics, and power analysis. Randomize at the right unit: user-level for persistent experiences, conversation-level for stateless interactions, cluster-level if spillovers are likely.
LLM experiments need safety guardrails beyond average engagement. Track harmful output rate, hallucination rate, policy violation rate, misinformation exposure, sensitive-topic failure rate, escalation to human review, latency p95/p99, GPU cost per successful task, and user trust signals.
Offline-online correlation is not guaranteed. A model may improve benchmark helpfulness but hurt retention due to latency, overlong answers, or worse personalization. Maintain a “metric ladder”: low-cost offline evals, human preference tests, limited dogfooding, staged rollout, then A/B launch with guardrails.
Sampling strategy is central. Random production samples estimate average quality, but stratified and stress-test samples reveal failures in rare but important segments. Oversample high-risk categories such as minors, health, elections, self-harm, hate speech, low-resource languages, and creator monetization workflows.
Statistical evaluation of pairwise preferences often uses win rate:
$\text{Win Rate} = \frac{\text{wins} + 0.5 \times \text{ties}}{\text{total comparisons}}$
Use confidence intervals from bootstrap or binomial approximations. Beware repeated prompts, correlated raters, and multiple comparisons across many model variants.
Launch decisions should combine quality, safety, latency, and cost. A model with +3% preference win rate may not ship if it adds 500 ms p95 latency or doubles inference cost. Conversely, a cheaper distilled model may ship if quality is statistically non-inferior and improves reliability.

Worked example

For “Design an evaluation framework for an LLM-powered assistant,” a strong candidate would first clarify the product surface: is the assistant for search, messaging, ads creation, customer support, or internal productivity? They would ask what “good” means for the user, whether the task has verifiable answers, what safety risks exist, and whether the goal is model selection, launch readiness, or post-launch monitoring. The answer should then be organized around four pillars: offline eval, human eval, online experimentation, and continuous monitoring. Offline, they would propose a representative prompt set from production logs plus curated slices for high-risk and long-tail cases, evaluated with task-specific metrics rather than generic BLEU. For human judgment, they would define a rubric for helpfulness, correctness, safety, and tone, use pairwise comparisons against a baseline, and measure inter-rater agreement. Online, they would run a staged A/B test with primary metrics like task success or retention, plus guardrails such as reports, harmful response rate, latency p95, and cost per completed task. One tradeoff to flag explicitly is that LLM-as-judge enables scale and rapid iteration but cannot replace calibrated human review for safety-sensitive categories. A strong close would be: if there were more time, they would analyze offline-online metric correlation, build segment-level dashboards, and establish a rollback threshold for safety regressions.

A second angle

For “How would you decide whether a new model version is better than the current one?”, the framing shifts from designing the whole evaluation system to making a launch decision under uncertainty. The same ideas apply, but the answer should emphasize comparison: baseline versus candidate, fixed prompt set, blinded human preference, statistically significant win rate, and online treatment effect. The candidate should separate model-quality metrics from product metrics: the new model may win human preference but lose online because it is slower, more verbose, or less aligned with the product’s interaction pattern. The right answer should also discuss non-inferiority for guardrails: the candidate model must not regress on safety, latency, cost, or key demographic/language slices. The best close is a decision rule, such as shipping only if the model improves task success by a meaningful amount while staying within pre-defined safety and infrastructure thresholds.

Common pitfalls

A common analytical mistake is over-indexing on automatic metrics like ROUGE, BLEU, or benchmark accuracy. For open-ended LLM products, these metrics can be weak proxies for user value and may reward superficial overlap or verbosity. A better answer explains when automatic metrics are useful, then adds human preference, task success, safety review, and online causal measurement.

A communication mistake is giving a generic “we would A/B test it” answer without defining the unit of randomization, primary metric, guardrails, or launch threshold. That sounds product-aware but shallow. A stronger response says, for example, “I would randomize at user level, use successful task completion as the primary metric, monitor harmful output rate and p95 latency as guardrails, and segment by language and intent.”

A depth mistake is treating human evaluation as objective ground truth. Human raters disagree, rubrics can be ambiguous, and crowd workers may lack domain expertise for medical, financial, political, or policy-sensitive content. Strong candidates discuss rater training, calibration examples, inter-rater reliability, adjudication, blinded evaluation, and bias checks.

Connections

Expect pivots into experimentation design, especially power analysis, interference, sequential testing, and guardrail metrics. Interviewers may also connect this to ranking/recommendation evaluation, integrity measurement, fairness across demographic or language slices, and cost-latency tradeoffs in ML systems. If they push on causal validity, be ready to discuss A/B testing limitations, novelty effects, heterogeneous treatment effects, and offline-online metric correlation.