LLM Evaluation: Faithfulness, Hallucination, And Human Review For Meta AI

What's being tested

Interviewers are probing whether you can turn fuzzy quality concerns for Meta AI into measurable evaluation metrics, statistically defensible comparisons, and actionable launch decisions. The core skill is separating faithfulness—whether an answer is supported by available context—from general answer quality, safety, helpfulness, or user satisfaction. For a Data Scientist, Meta cares because generative AI failures are often rare, high-severity, segment-specific, and poorly captured by aggregate engagement metrics. A strong answer shows you can combine offline evals, human review, model-assisted judging, and online experimentation without overclaiming causality.

Core knowledge

Faithfulness means the response is grounded in the provided source, retrieval context, tool output, or conversation state. A faithful answer can still be unhelpful; an unfaithful answer can sound fluent. Keep it distinct from factuality, which asks whether a claim is true in the real world.
Hallucination rate should usually be claim-level, not answer-level. If an answer contains 10 factual claims and 1 unsupported claim, answer-level labeling loses severity detail. A useful metric is:
$\text{Unsupported Claim Rate}=\frac{\#\text{unsupported factual claims}}{\#\text{factual claims reviewed}}$
Grounded QA evaluation often uses a three-way taxonomy: supported, unsupported, and not verifiable from context. This prevents penalizing the model for reasonable uncertainty and helps distinguish retrieval gaps from generation errors.
Human review needs clear annotation rubrics, calibrated examples, and inter-rater reliability checks. Track Cohen’s kappa or Krippendorff’s alpha for label consistency; raw agreement can be misleading when most answers are non-hallucinated.
Sampling strategy matters more than sample size alone. Use stratified samples by language, market, device, prompt category, answer length, retrieval usage, sensitive domains, and model version. Rare but important segments should be oversampled, then reweighted to estimate population-level rates.
Confidence intervals are critical when hallucinations are rare. For a binary answer-level hallucination metric, use a binomial interval or bootstrap. If observed rate is $\hat p$ , approximate standard error is $\sqrt{\hat p(1-\hat p)/n}$ , but exact or Wilson intervals are safer for small counts.
Severity-weighted metrics often beat simple rates. A fabricated restaurant recommendation and incorrect medical advice should not count equally. Define severity tiers such as minor, major, and critical, then report both frequency and weighted harm:
$\text{Weighted Hallucination Score}=\sum_s w_s \cdot \text{rate}_s$
LLM-as-a-judge can scale evaluation but should not be treated as ground truth. Validate judge labels against human labels, report precision/recall by segment, and monitor judge drift when prompts, policies, or model families change. Use it for triage or trend detection, not final high-stakes launch decisions.
Offline eval sets should include natural traffic samples, adversarial prompts, multilingual cases, long-context cases, and high-risk domains. Static benchmarks are useful for regression testing, but they can become stale or overfit if repeatedly optimized against.
Online metrics are indirect. User satisfaction, thumbs-down rate, conversation abandonment, regeneration rate, and follow-up correction prompts can signal quality issues, but they confound faithfulness with tone, latency, relevance, and user expectations. Pair behavioral metrics with reviewed samples.
Experiment design should specify the unit of randomization, exposure logging, guardrail metrics, and minimum detectable effect. For example, randomize users or conversations, compare hallucination rates from reviewed samples, and monitor CSAT, report rate, latency, and sensitive-domain failure rates.
Root-cause analysis should stay at the metric layer: segment by prompt intent, source availability, answer length, retrieval status, language, and recency of facts. As a DS, you identify whether errors concentrate in certain cohorts or contexts; you do not need to design retrieval infrastructure or model serving changes.

Worked example

For “How would you evaluate whether a new Meta AI model reduced hallucinations?”, start by clarifying the product surface, whether answers are grounded in provided context or open-domain, and whether the goal is launch gating or ongoing monitoring. State that you would not rely only on engagement metrics because users may not recognize fluent falsehoods. Organize the answer around four pillars: metric definition, sampling and labeling, statistical comparison, and decision framework.

First, define the primary metric as answer-level or claim-level hallucination rate, with a preference for claim-level unsupported rate plus a severity-weighted metric for high-risk categories. Second, create a stratified evaluation sample from real traffic and curated stress tests, then have trained reviewers label claims using a rubric like supported, unsupported, and not enough information. Third, compare the old and new model using confidence intervals or a hypothesis test, ideally on paired prompts when possible to reduce variance. Fourth, set launch criteria: the new model must reduce hallucination rate overall, not regress in sensitive segments, and maintain guardrails like latency, user satisfaction, and refusal quality.

A specific tradeoff to flag is coverage versus precision in review: claim-level annotation is more accurate but expensive, while answer-level labels scale faster but hide partial failures. A strong close would be: if given more time, I would validate an LLM-as-a-judge against human labels to scale monitoring, add severity weighting, and build a recurring audit focused on languages and high-risk intents where average metrics may mask regressions.

A second angle

For “How would you design human review for hallucinations in Meta AI?”, the same concept shifts from metric comparison to measurement system quality. Start with the annotation rubric: reviewers need source context, generated answer, and instructions for what counts as unsupported versus merely incomplete. Then address sample design, reviewer calibration, disagreement resolution, and inter-rater reliability. The key constraint is that human labels are expensive and noisy, so you would use stratified sampling and possibly model-assisted pre-screening while preserving an unbiased audit sample. The output is not just a hallucination rate; it is a trusted evaluation process with known uncertainty and documented failure modes.

Common pitfalls

Pitfall: Treating “hallucination” as the same as “bad answer.”

A vague answer like “measure thumbs-downs and user complaints” misses the core evaluation problem. A better answer separates faithfulness, factuality, helpfulness, safety, and satisfaction, then explains which metric answers which question.

Pitfall: Reporting one aggregate hallucination rate without uncertainty or segments.

A model can improve from 4.0% to 3.6% overall while getting worse in medical prompts, non-English queries, or long-context tasks. Always mention confidence intervals, minimum detectable effect, and segment-level guardrails before recommending launch.

Pitfall: Overtrusting automated judges.

Saying “use another LLM to detect hallucinations” is incomplete unless you validate it against human labels. Interviewers expect you to discuss judge bias, calibration, false negatives on subtle unsupported claims, and periodic revalidation as traffic changes.

Connections

Interviewers may pivot from this topic into ranking evaluation, A/B testing, human annotation quality, causal inference, or trust and safety metrics. The same measurement discipline also applies to recommender quality, misinformation detection, and integrity classifiers: define the construct, build reliable labels, quantify uncertainty, and monitor heterogeneous effects.