LLM Product Experimentation

What's being tested

Interviewers are probing your ability to design and analyze controlled experiments that measure LLM product changes (prompt edits, model versions, UI signals) while balancing statistical rigour, labeling constraints, and safety guardrails. Expect to justify the unit of randomization, choose a clear primary metric, compute sample size/power, and design monitoring that catches regressions and biases. Anthropic cares because small changes in an LLM pipeline can produce subtle quality or safety regressions that require careful causal inference to detect and act on.

Core knowledge

A/B test / randomized controlled trial basics: randomize at the correct unit (user-level vs request-level), ensure balance checks, and prefer intention-to-treat analysis to avoid post-randomization bias.
Unit of analysis vs unit of randomization: if users see many queries, use user-level assignment to avoid correlated outcomes; account for intra-class correlation (ICC) in sample-size formulas.
Sample size / power formulas: for continuous outcome $n = 2 \left(\frac{Z_{1-\alpha/2}+Z_{1-\beta}}{\delta/\sigma}\right)^2$ and for proportion $n = 2 (Z_{1-\alpha/2}+Z_{1-\beta})^2 \frac{\bar p(1-\bar p)}{\delta^2}$ where $\delta$ is MDE.
Paired vs unpaired tests: exploit paired designs (same prompt per user) to reduce variance when feasible; paired t-test or mixed-effects model vs two-sample t-test for independent samples.
Primary metric design: choose one clear primary metric (e.g., human-rated helpfulness or task success) plus pre-specified guardrails (e.g., safety incidents, latency, cost-per-response).
Human labeling strategy: use pairwise preference (Bradley–Terry or Plackett–Luce) for small-sample sensitivity; calibrate a learned reward model to scale but validate periodically with fresh human labels.
Multiple comparisons & sequential testing: use Bonferroni/Holm for few comparisons, Benjamini–Hochberg for FDR control, and alpha-spending/O’Brien–Fleming or Bayesian stopping rules for peeking.
Heterogeneity & segmentation: pre-specify subgroups (device, geography, power-users) and treat subgroup analyses as exploratory unless pre-registered; consider hierarchical models for shrinkage.
Safety and bias monitoring: instrument safety signals (toxicity, hallucination rate) as guardrails; set automatic roll-back if thresholds exceeded; sample flagged outputs for human review.
Offline proxies and their limits: metrics like log-prob, perplexity, or automated classifiers are useful for triage but can poorly correlate with human preference; quantify correlation before trusting proxies.
Causal diagnostics: run A/A tests, test for differential attrition, check randomization fidelity, and test SUTVA (no interference) assumptions; if interference exists, consider clustered randomization.
Analysis pipeline: pre-specify analysis plan, use robust SEs for clustered data, report effect sizes with confidence intervals, and include Bayesian credible intervals where appropriate.

Worked example

Design an online A/B test comparing two prompts for “helpfulness” while controlling cost and safety. First 30s: clarify the primary metric (human-rated helpfulness on 1–5), unit of randomization (user-level to avoid correlated outcomes), target MDE, acceptable Type I/II rates, and guardrail thresholds for safety incidents. Skeleton: (1) Randomize users 50/50, log exposures and downstream behavior; (2) Collect human ratings on a stratified random sample of responses, plus automated proxies for all responses; (3) Compute sample size using the continuous formula with estimated $\sigma$ from a pilot; (4) Monitor guardrails and apply pre-registered stopping rules. Key tradeoff: whether to prioritize query-level randomization (lower variance for short-term effects) versus user-level (reduces contamination) — choose user-level if users can see many different prompts. Close by stating next steps: run A/A for two weeks, validate proxy correlation with human labels, and plan subgroup analyses with hierarchical models; if time allowed, also instrument an exploration band for online preference learning.

A second angle

Evaluate a new model checkpoint that reduces token costs but might slightly degrade factuality. The same experimentation principles apply but emphasize different constraints: the primary metric could be a composite (weighted factuality loss + cost-per-token), and safety guardrails must be stricter (e.g., maximum acceptable hallucination rate). Use stratified sampling to oversample queries likely to hallucinate (long-form answers, niche topics), and run a stepped rollout (canary → 5% → 25% → 100%) with sequential testing. Incorporate a small offline human evaluation on edge-case prompts and calibrate an automated factuality classifier; do not rely solely on offline log-probability or perplexity as proxies.

Common pitfalls

Pitfall: Ignoring correlated observations (many queries per user) and calculating sample size as if data are independent — leads to underpowered experiments and misleading p-values. Always adjust for ICC or randomize at the clustered unit.

Pitfall: Failing to pre-specify a single primary metric and thresholds; instead listing many ad hoc metrics. This looks like p-hacking and makes it unclear what “success” means — declare the primary metric and guardrails up front.

Pitfall: Overtrusting automated proxies (log-prob, perplexity, heuristic classifiers) without validating correlation to human preference; this causes silent regressions. Always validate proxies on fresh human-labeled datasets and periodically re-evaluate.

Connections

This topic commonly connects to causal inference (instrumental variables, uplift modeling) when A/B testing is infeasible, and to model evaluation/ML metrics (reward-model calibration, offline vs online parity) for scalable labeling. Interviewers may pivot to sequential experimentation or to optimizing multi-objective metrics (cost vs quality).