Anthropic LLM Evaluation Metrics

What's being tested

Interviewers probe your ability to translate high-level product goals for an LLM (helpfulness, harmlessness, reliability) into operational metrics, design robust evaluation experiments, and reason statistically about noisy human labels and automated signals. They want to see you balance practical tradeoffs (label cost, sensitivity, timeliness) with rigorous inference (power, significance, bias correction) and communicate what a measured change actually implies for users.

Core knowledge

Metric definition: A good metric is measurable, sensitive, interpretable, and aligned to product value. Distinguish primary (launch decision) vs guardrail metrics and define units (per-prompt, per-session, per-user).
Human evaluation signals: Use pairwise preference and Likert ratings; handle ordinal outcomes with appropriate models (ordinal logistic) and report inter-annotator variability, not just means.
Annotator modeling: Calibrate and model annotator bias/variance with mixed-effects models or Dawid & Skene style EM to estimate true labels and per-annotator confusion matrices.
Agreement metrics: Report Cohen's kappa (binary), Krippendorff's α (multi-class/ordinal), and percent agreement; low kappa implies high noise and inflates required sample sizes.
Preference-scaling: For pairwise comparisons use Bradley-Terry or TrueSkill to estimate item scores and transitive ranking from sparse comparisons; interpret win-rates and confidence intervals.
Composite vs constraint-based decisions: Either weight metrics into a single utility (requires stakeholder weighting) or enforce constraints (e.g., safety false-positive < X) while optimizing helpfulness.
Automatic metrics as signals: BLEU/ROUGE/embedding-based similarity and LLM-based classifiers are useful as noisy proxies; quantify signal fidelity (precision/recall) against human labels before trusting them for experiments.
Statistical testing & power: For differences in proportions, approximate sample size per arm:

$n \approx \frac{(Z_{1-\alpha/2}\sqrt{2\bar p(1-\bar p)} + Z_{1-\beta}\sqrt{p_1(1-p_1)+p_2(1-p_2)})^2}{(p_1-p_2)^2}.$

Always pre-specify $\alpha$ , $\beta$ , and minimum detectable effect.
Multiple comparisons & sequential testing: Apply corrections (Bonferroni, Benjamini-Hochberg) or sequential methods (alpha-spending, Bayesian approaches) when running many slices or iterative launches.
Calibration & probabilistic outputs: For binary safety classifiers measure calibration with Brier score and Expected Calibration Error (ECE); a well-calibrated model assigns probabilities that match empirical frequencies.
Slices, fairness, and heterogeneity: Predefine slices (prompt style, locale, user cohort) and power them separately; heterogenous treatment effects are as important as global averages.
Monitoring & rollout: Use short-term A/B for signal, then ramp with cohort analysis and statistical guarantees (e.g., non-inferiority tests for safety metrics).

Worked example — designing metrics for a helpfulness vs safety tradeoff

Frame the problem: ask what "helpful" and "safe" mean for the product, stakeholder weights, and acceptable safety thresholds; clarify labeling budget and production telemetry availability. Organize the solution around three pillars: (1) operational definitions and labels (binary safety categories, 5-point helpfulness Likert), (2) evaluation protocol (mixed panel of experts+crowd, balanced prompt bank, blinding), and (3) decision rule (constraint-first or utility-weighted composite). Explicit decision: prefer a constraint-based approach (safety must not degrade beyond X absolute percentage points) because safety failures have asymmetric user harm—flag this tradeoff. Statistical steps: estimate inter-annotator agreement, compute required sample size to detect the minimum useful effect on helpfulness under the safety constraint, and pre-register tests plus multiple-comparison plan. Close: "if I had more time, I'd run pilot labeling to estimate variance components, build annotator calibration curves, and validate any automatic safety classifier against held-out human labels."

A second angle — detecting post-deployment safety regressions

Now focus on continuous monitoring: set rolling 7/28-day baselines and run time-series tests for shifts in safety incident rates, adjusting for traffic composition and prompt distribution drift. Use control charts or Bayesian change-point detection combined with stratified analyses (new users, particular locales). Where human labeling is too slow, trigger triage: sample incidents flagged by an automatic classifier for rapid human review, estimate classifier precision to adjust alarm thresholds, and if an alert passes human-confirmed thresholds, initiate rollback or ramp-down. The emphasis is on low false-alarm operational cadence while maintaining sensitivity to real regressions.

Common pitfalls

Pitfall: Treating automatic metrics as ground truth.
Relying on BLEU/embedding-similarity or in-model classifiers without validating against human labels leads to undetected failures; always quantify precision/recall of automated signals and propagate that uncertainty into decisions.

Pitfall: Ignoring annotator heterogeneity.
Reporting only mean ratings hides systematic annotator bias; model annotator effects or use consensus-creation and report agreement metrics so confidence intervals reflect true label noise.

Pitfall: Presenting a single composite score without breakdown.
A composite utility can hide critical regressions on guardrail dimensions; accompany any composite with per-metric breakdowns and predefined safety triggers.

Connections

This area links directly to A/B testing design (sequential testing, power), human-labeling workflows (annotation interface, qualification, calibration), and causal inference for deployment decisions (confounding control, stratified randomization). Interviewers may pivot to these topics to probe depth.