LLM Evaluation, Human Preference, And Safety

What's being tested

Interviewers are testing whether you can turn ambiguous product goals for an AI system into measurable, reliable, and decision-ready evaluation frameworks. For a Data Scientist at Meta, this means connecting human preference, model quality, safety risk, and business impact without over-trusting any single benchmark or A/B metric. The probe is usually not “do you know RLHF?” but “can you design an evaluation that detects regressions, handles noisy raters, protects users, and supports launch decisions under tradeoffs?” Strong answers combine statistical rigor, product judgment, and awareness of LLM-specific failure modes like hallucination, jailbreaks, sycophancy, and distribution shift.

Core knowledge

Offline evaluation for LLMs should separate task quality, safety, latency, and cost. Common metrics include win_rate, factuality rate, refusal accuracy, toxicity rate, p95_latency, and cost per 1K tokens; a launch decision usually needs a metric suite, not one scalar score.
Human preference evaluation often uses pairwise comparisons because humans are better at relative judgments than absolute ratings. If model A beats model B on $w$ of $n$ prompts, estimate $\hat{p}=w/n$ and confidence interval $\hat{p}\pm1.96\sqrt{\hat{p}(1-\hat{p})/n}$ .
Bradley-Terry models convert pairwise wins into latent quality scores:
$P(i \succ j)=\frac{e^{s_i}}{e^{s_i}+e^{s_j}}$
This underlies systems like Chatbot Arena-style rankings and is preferable to raw win rate when comparisons are sparse or not fully crossed.
Elo ratings are an online approximation for pairwise quality, updated as $R_i' = R_i + K(S_i - E_i)$ . They are easy to explain but sensitive to match ordering, prompt mix, and adversarial sampling; use bootstrapping for uncertainty.
Inter-rater reliability matters because preference labels are noisy. Use Cohen’s kappa for two raters, Fleiss’ kappa for multiple raters, or Krippendorff’s alpha for missing labels; $\kappa < 0.4$ usually signals unclear rubrics or subjective tasks.
Annotator aggregation should handle rater bias and expertise. Majority vote is fine for simple labels, but Dawid-Skene, MACE, or item-response models estimate latent truth and worker reliability, useful when raters vary on safety, medical, or policy-sensitive judgments.
Safety evaluation needs both benign and adversarial prompt sets. Track categories such as self-harm, hate, sexual content, violence, cyber abuse, privacy leakage, and illegal advice; measure both unsafe compliance and over-refusal, since excessive refusals degrade usefulness.
Refusal metrics should distinguish true positives and false positives. For harmful prompts, measure refusal recall; for benign prompts, measure helpful compliance. A model with 99% harmful refusal but 20% benign over-refusal may be safer but worse for product engagement.
LLM-as-judge can scale evaluation but introduces bias. It is useful for triage across tens of thousands of samples, but validate against human labels, randomize answer order, blind model identity, and measure judge-model agreement; avoid using the same model family as judge and candidate.
Benchmark contamination is a major edge case. Public datasets like MMLU, TruthfulQA, HumanEval, and MT-Bench may appear in training corpora, so use private holdouts, time-split datasets, canary prompts, and live traffic samples to estimate real generalization.
Online experiments must define user-level success metrics beyond thumbs-up. Track DAU, retention, session depth, task completion, report rate, regeneration rate, copy/share rate, and negative feedback; guardrails should include safety incidents, latency, and compute cost.
Power analysis is harder with rare safety events. If baseline unsafe rate is 0.1%, detecting a 20% relative reduction needs very large samples; combine production telemetry with targeted red-team sets and sequential monitoring rather than relying only on ordinary A/B tests.

Worked example

For “Design an evaluation framework for a new Meta AI assistant,” start by clarifying the assistant’s surface, target users, and launch threshold: is this for Messenger, Instagram, WhatsApp, or internal productivity, and is the goal engagement, task success, safety, or trust? State assumptions early: “I’ll design a pre-launch offline evaluation, a human preference study, and an online monitoring plan, with separate gates for quality and safety.” Organize the answer around four pillars: task coverage, human preference measurement, safety/red-team evaluation, and online experimentation.

For task coverage, propose a representative prompt taxonomy: open-ended chat, search-like factual questions, creative writing, coding, multilingual queries, image-related prompts if relevant, and high-risk domains. For preference, use pairwise blinded comparisons between the candidate and baseline model, estimate win_rate with confidence intervals, and fit a Bradley-Terry model if comparing multiple variants. For safety, build adversarial sets across policy categories and track harmful compliance, benign over-refusal, and escalation to safety classifiers or fallback responses. For online testing, recommend a small holdback or staged rollout with thumbs_up_rate, report rate, task completion, p95_latency, token cost, and severe-incident guardrails.

A key tradeoff to flag is that optimizing for human preference can reward verbosity, confidence, or sycophancy even when factuality worsens. To address that, separate “preference” from “grounded correctness” and include expert fact-checking for sampled factual prompts. Close by saying that with more time, you would add longitudinal monitoring for distribution shift, subgroup analysis by language/region, and post-launch red teaming for emerging jailbreaks.

A second angle

For “How would you evaluate whether an RLHF update improved model helpfulness without hurting safety?”, the same ideas apply, but the constraint is regression detection rather than greenfield evaluation. The strongest framing is to compare the new policy against the current production model on a fixed private eval set, fresh traffic samples, and targeted safety prompts. Preference win rate alone is insufficient because RLHF can increase helpfulness while increasing unsafe compliance or hallucination. You would recommend launch gates such as statistically significant helpfulness lift, no degradation on harmful refusal recall, no increase in severe safety events, and acceptable p95_latency or cost. The emphasis shifts from “what should we measure?” to “what evidence is enough to ship?”

Common pitfalls

Pitfall: Treating human preference as ground truth.

A tempting answer is “we’ll ask users which response they like and ship the higher win-rate model.” That misses that users may prefer confident but false, entertaining but unsafe, or longer responses. A better answer separates preference, factuality, policy compliance, and downstream product outcomes.

Pitfall: Ignoring sampling and rater design.

Many candidates propose “collect 1,000 ratings” without saying from whom, on what prompts, or with what rubric. Stronger answers stratify prompts by task and risk category, blind model identity, randomize answer order, measure inter-rater reliability, and predefine tie/unsafe handling.

Pitfall: Over-indexing on public benchmarks.

Saying “we’ll use MMLU and MT-Bench” sounds concrete but shallow if not paired with private, product-specific evaluation. Public benchmarks are useful for comparability, but Meta-scale deployment needs traffic-representative prompts, multilingual coverage, adversarial testing, and online guardrails.

Connections

Interviewers may pivot from this topic into A/B testing, especially power analysis for rare harms and sequential rollouts. They may also ask about causal inference, ranking systems, content integrity, or responsible AI policy enforcement, since LLM evaluation often combines product metrics with safety constraints and human judgment.