LLM Evaluation And RAG Product Understanding

What's being tested

Interviewers are probing whether you can evaluate an LLM product as a Data Scientist, not whether you can build the model or design the serving stack. The core skill is translating a fuzzy user-facing experience into measurable offline quality, online product impact, and risk guardrails. For a RAG system, you need to separate failures caused by retrieval, generation, grounding, policy, and user intent mismatch. Meta cares because LLM products affect engagement, trust, safety, and cost at massive scale, and DS candidates must know how to decide whether a change actually improves user value.

Core knowledge

RAG evaluation has two separable layers: retrieval quality and answer quality. Retrieval asks “did we fetch the right evidence?” using Recall@k, MRR, or nDCG; generation asks “did the answer use evidence correctly?” using groundedness, completeness, fluency, and safety labels.
Retrieval recall is often the first bottleneck: if the supporting document is not in the top k, the generator cannot reliably answer. A simple metric is $Recall@k = \frac{\text{queries with at least one relevant doc in top k}}{\text{total queries}}$ Stratify by query type, language, freshness, and entity popularity.
Answer quality metrics should distinguish helpfulness, correctness, groundedness, and harmlessness. A response can be fluent but wrong, correct but not grounded in retrieved sources, or grounded but unhelpful because it fails the user’s actual intent.
Faithfulness or groundedness measures whether claims in the answer are supported by retrieved context. A practical annotation rubric asks raters to label each atomic claim as supported, unsupported, contradicted, or unverifiable; aggregate as unsupported-claim rate or answer-level pass rate.
LLM-as-judge can scale evaluation but introduces bias. Models such as GPT-4 or Llama-based judges may prefer verbose answers, share blind spots with the candidate model, or over-credit plausible text. Calibrate judges against human labels and report agreement, not just raw judge scores.
Human evaluation needs clear rubrics and quality control. Use stratified samples, blinded side-by-side comparisons, gold questions, and inter-rater reliability such as Cohen’s kappa or Krippendorff’s alpha. Low agreement usually means the task definition is ambiguous, not that raters are bad.
Online metrics should connect to product value: thumbs_up_rate, answer acceptance, follow-up clarification rate, session success, query reformulation rate, long-clicks, retention, or creator/business outcomes depending on surface. Treat engagement carefully because longer sessions can mean either delight or failure.
Guardrail metrics matter as much as average quality. Track hallucination rate, unsafe completion rate, sensitive-topic refusal accuracy, citation error rate, latency, cost per successful answer, and complaint/report rate. A launch can be blocked by tail-risk degradation even if mean helpfulness improves.
Experiment design should handle heterogeneous treatment effects. RAG changes often help factual, long-tail, or fresh-knowledge queries while hurting casual chat or ambiguous queries. Predefine segments by intent, language, region, query length, user tenure, and source availability.
A/B testing must avoid contaminated units. If users share conversations, see cached answers, or interact across devices, randomizing at request level can leak treatment effects. Prefer user-level randomization for product outcomes, with query-level offline evaluation for fast iteration.
Counterfactual logging is valuable but limited. You can compare retrieved candidates, judge alternative answers, or replay fixed query sets, but offline wins do not guarantee online wins because users adapt, ask follow-ups, abandon, or change trust behavior.
Failure diagnosis should decompose the funnel: query understanding → retrieval candidates → ranking of evidence → generation → citation/grounding → user feedback. For DS, the job is to quantify which stage explains metric movement, not to redesign the index or serving architecture.

Worked example

How would you evaluate a RAG-based chatbot for Meta AI?

A strong candidate would start by clarifying the product goal: is the assistant answering factual questions, helping users complete tasks, summarizing content, or supporting search-like discovery? They would also ask what sources the system can retrieve from, whether citations are shown, which languages and markets matter, and what failure modes are unacceptable, such as medical misinformation or privacy leakage. The answer should be organized around four pillars: offline retrieval quality, offline answer quality, online product impact, and safety/cost guardrails.

For retrieval, they would propose Recall@k, MRR, and coverage by query segment, using human-labeled query-document relevance where possible. For generation, they would define a rubric for correctness, groundedness, completeness, and tone, combining human side-by-side evaluations with calibrated LLM judges. For online impact, they would run a user-level A/B test measuring successful sessions, answer acceptance, negative feedback, reformulation rate, and retention, while monitoring latency and unsafe responses as guardrails.

One tradeoff to flag explicitly is that increasing k may improve evidence recall but also add distracting context, higher latency, and more opportunities for the generator to cite irrelevant information. A crisp close would be: “If I had more time, I’d build a recurring eval set stratified by intent and freshness, then track whether offline judge gains predict online success by segment.”

A second angle

How would you diagnose a drop in helpfulness for an LLM assistant after a retrieval update?

The same evaluation framework becomes a metric decomposition problem. Start by verifying whether the drop is broad or concentrated in segments such as fresh-news queries, non-English users, long queries, or users with low prior engagement. Then compare pre/post retrieval metrics like Recall@5, citation relevance, empty-retrieval rate, and retrieved-document freshness against generation metrics like groundedness and refusal rate. If retrieval relevance is stable but user helpfulness drops, the issue may be answer synthesis, citation presentation, latency, or a shift in query mix rather than retrieval itself. The DS contribution is to isolate the metric layer where degradation appears and quantify confidence, not to prescribe low-level retrieval infrastructure changes.

Common pitfalls

Pitfall: Treating “LLM quality” as one metric.

A weak answer says “I’d measure accuracy” or “I’d use user ratings” without decomposing retrieval, grounding, safety, and product success. A better answer defines a metric tree: input coverage, evidence relevance, answer correctness, groundedness, user success, and guardrails, then explains which metrics are diagnostic versus launch-critical.

Pitfall: Over-trusting automated judges.

It is tempting to say an LLM judge can evaluate everything cheaply. Interviewers expect you to mention calibration against human labels, blinded pairwise evaluation, position bias, verbosity bias, and segment-level error analysis. A strong DS answer uses LLM judges for scale but keeps human evaluation as the source of truth for high-risk or ambiguous cases.

Pitfall: Ignoring user behavior and causal inference.

Offline eval sets are necessary but insufficient because users react to answers, ask follow-ups, abandon sessions, or lose trust over time. A launch decision should rely on an A/B test with clear primary metrics, guardrails, power assumptions, and segmentation; otherwise, you may optimize benchmark quality while hurting the actual product.

Connections

Interviewers may pivot from this topic into experiment design, ranking evaluation, human annotation quality, causal diagnosis, or responsible AI metrics. They may also ask how to evaluate recommender systems or search products, where many of the same ideas apply: relevance labels, nDCG, user satisfaction, guardrails, and heterogeneous treatment effects.