LLM Evaluation, Offline Metrics, Online Monitoring, and Regression Testing

What's being tested

Interviewers probe whether an engineer can design and operationalize a reliable evaluation and monitoring stack for large language models across pre-deploy (offline) and post-deploy (online) stages. They want to see technical judgment about which metrics and tests catch regressions, how to compute and surface them in production, and how to trade off sensitivity, cost, and user-impact when automating rollbacks or alerts. At OpenAI, this shows you can keep model behavior safe, performant, and observable across releases without owning data-platform plumbing.

Core knowledge

Offline evaluation vs online monitoring: offline is controlled held-out evaluations and deterministic regression suites; online is telemetry from live traffic (latency, error rate, content signals) used for SLOs, alerts, and canary decisions. Each has different noise, bias, and coverage tradeoffs.
Token-level and generation metrics: know perplexity, next-token accuracy, and log-likelihood for training/validation; generation metrics like BLEU/ROUGE are weak for open-ended text — use them only when task is constrained.
Embedding and semantic metrics: compute cosine similarity of sentence embeddings or use BERTScore; watch that embedding drift can reflect prompt-distribution shift, not model degradation.
Calibration and reliability: measure Expected Calibration Error (ECE) with discrete bins: $\text{ECE}=\sum_k \frac{n_k}{N} | \text{acc}(k)-\text{conf}(k)|$ to catch over/under-confidence of sampling/uncertainty estimates.
Drift detection: use Population Stability Index (PSI) or KL divergence between input-feature histograms or logit distributions to flag distribution shift. PSI > 0.25 typically signals significant shift for many features.
Regression testing hygiene: maintain a deterministic test-suite of prompts (unit/regression), seed-wrapped random seeds, and saved golden outputs or invariants (e.g., no-addition-of-forbidden-phrases). Automate these in CI tied to model snapshots.
Online SLOs & tail metrics: instrument p50/p95/p99 latency, request success rate, and user-facing quality proxies (e.g., automated toxicity score rate). Tie alerts to sustained breaches over windows (e.g., 5-minute and 24-hour).
Canary and rollout strategies: use small-percentage canaries, shadow testing (real traffic mirrored without affecting users), and progressive rollout with automated rollback when correlated signals exceed thresholds. Balance sample size vs detection speed.
Statistical significance for comparisons: for rates use binomial tests or Wilson intervals; correct for multiple comparisons (Bonferroni or BH) when monitoring many metrics across features/segments.
Sampling and privacy constraints: determine sampling policy that preserves signal while respecting privacy (e.g., hash-based sampling, removing PII before storage); ensure reproducibility by saving input hashes and model snapshot ids.
Alert design and debouncing: tune thresholds to manage precision/recall of alerts; implement debouncing windows and escalation tiers to reduce noisy automated rollbacks.
Explainability for incidents: collect contextual traces (prompt, model version, logits, embed vectors, runtime config) so an engineer can reproduce and triage quickly; keep trace size bounded to cost constraints.

Worked example — "Design evaluation and monitoring for a deployed LLM assistant"

Start by clarifying scope: what are critical user journeys, privacy constraints, and SLOs (latency, availability, quality) for the assistant. Organize the solution into three pillars: (1) Offline safeguards — a CI regression suite with deterministic prompts, task-specific metrics (e.g., instruction-following accuracy), and human-eval sampling; (2) Pre-deploy checks — automated canary run on a slice of production traffic plus shadow testing; (3) Online monitoring & rollback — SLOs instrumented (latency p99, error rate), content safety proxies (toxicity classifier rate), and drift detectors (PSI on input tokens). A key tradeoff to call out: automated rollback sensitivity — lowering thresholds catches regressions fast but increases false rollbacks; tune by combining short-window alerting for ops and longer-window for rollbacks. Close by noting future work: if more time, add automatic root-cause correlation (attributing regressions to prompt types, temperature changes, or dataset shift) and a labeled human-eval pipeline to triage borderline automated alerts.

A second angle — "Detecting and investigating model regressions post-deployment"

Here constraints shift: sparse labeled signals, noisy proxies, and the need for fast mitigation. Focus on prioritization: keep a small set of high-leverage deterministic regression prompts that exercise safety-critical behaviors and maintain automated "unit tests" that must pass per release. For noisy user-signal metrics (e.g., decreased engagement), correlate with controlled signals: changes in toxicity rate, embedding drift, or changes in top-k token probability distributions. Emphasize the importance of reproducible reproduction: capture the exact model snapshot, prompt, seed, and runtime config to replay suspicious requests offline. Finally, use progressive rollback or targeted hotfixes (e.g., prompt engineering or constraint filters) while the model author runs a full fix.

Common pitfalls

Pitfall: Over-relying on a single metric like perplexity — perplexity can improve while user-facing safety or instruction-following degrades; always pair with task-specific and safety proxies.

Pitfall: Treating offline test pass as sufficient — offline evaluation rarely captures real-world prompt diversity and distribution shift, so lack of post-deploy monitoring means regressions go unnoticed.

Pitfall: No reproducible traces — without model snapshot ids, seeds, and sanitized prompts, investigations take far longer; always store minimal replayable context with telemetry.

Tip: When multiple alerts fire, prioritize signals with high correlation to user-impact metrics and deterministic tests before issuing rollbacks.

Connections

These topics commonly pivot to A/B experimentation and metrics instrumentation (experiment design), feature-store drift detection, and model interpretability/attribution for root-cause analysis. An interviewer may ask you to design an experiment or explain how monitoring outputs feed into CI/CD.