RAG Retrieval And Search Quality Evaluation

What's being tested

Interviewers are testing whether you can evaluate an AI/search system as a product system, not just recite IR metrics like precision and recall. For a RAG product, the key question is whether retrieved context is relevant, complete, fresh, safe, and usable by the generator under latency and cost constraints. Meta cares because retrieval quality affects AI assistant usefulness, internal knowledge tools, ads/search relevance, content recommendations, and trust-sensitive surfaces where bad retrieval can produce hallucinations or policy violations. A strong Data Scientist should be able to design offline evaluation, online experiments, diagnostic slices, and metric tradeoffs that connect model behavior to user outcomes.

Core knowledge

RAG quality decomposes into at least three layers: retrieval quality, context assembly quality, and generation quality. A bad answer can come from missing the right document, retrieving the right document but truncating the relevant passage, or generating incorrectly despite good evidence.
Core retrieval metrics include Precision@k, Recall@k, Hit Rate@k, MRR, MAP, and NDCG. For binary relevance, $\text{Recall@k}=\frac{\#\text{ relevant docs in top k}}{\#\text{ total relevant docs}}$ ; for ranked graded relevance, NDCG rewards putting highly useful documents near the top.
NDCG is especially useful when labels have graded relevance:
$DCG@k=\sum_{i=1}^{k}\frac{2^{rel_i}-1}{\log_2(i+1)}, \quad NDCG@k=\frac{DCG@k}{IDCG@k}$
It captures whether the best evidence appears early enough to fit into a context window.
Recall@k is often more important than Precision@k for RAG retrieval because the generator only needs enough correct evidence to answer. However, low precision increases context pollution, hallucination risk, latency, and token cost, especially when irrelevant chunks distract the model.
Build an offline evaluation set from real user queries, synthetic edge cases, and human-labeled query-document pairs. Labels should distinguish exact answer support, partial support, topical relevance, outdated information, unsafe content, and duplicate near-matches; binary labels are often too coarse.
RAG systems usually use multi-stage retrieval: lexical retrieval such as BM25, dense embedding retrieval with FAISS/ScaNN/HNSW, hybrid retrieval, then cross-encoder reranking. Dense retrieval improves semantic matching; BM25 handles rare entities, IDs, names, and exact terms better.
ANN indexes trade recall for latency and memory. HNSW gives strong recall-latency performance but can be memory-heavy; IVF/PQ compresses vectors for large corpora but may reduce recall. Exact nearest-neighbor search is feasible for small corpora, but approximate search becomes necessary at millions to billions of vectors.
Chunking is a first-order quality lever. Small chunks improve precision but may lose context; large chunks improve completeness but waste tokens and dilute signal. Common chunk sizes range from a few hundred to ~1,000 tokens, often with overlap to avoid cutting facts across boundaries.
Online metrics should include task success, answer helpfulness, follow-up rate, reformulation rate, abandonment, latency, cost per successful answer, thumbs up/down, and human quality ratings. Offline NDCG improvements do not always translate to user success if the generator or UI bottlenecks dominate.
Use diagnostic slices: head vs tail queries, entity-heavy queries, multilingual queries, fresh-news queries, policy-sensitive queries, long queries, ambiguous queries, and no-answer queries. Aggregate metrics can hide severe failures in rare but high-risk slices.
For no-answer or insufficient-evidence cases, retrieval evaluation must include abstention quality. A system that always retrieves something may look good on Hit@k but cause hallucinations; measure false-answer rate, unsupported claim rate, and calibration of “I don’t know.”
Human evaluation should separate relevance from faithfulness. A retrieved passage can be relevant but not sufficient; a generated answer can be fluent but unsupported. Strong rubrics ask annotators: “Does the answer cite evidence?”, “Is every factual claim supported?”, and “Is the evidence current?”

Worked example

Evaluate retrieval quality for a RAG assistant before launch. A strong candidate would start by clarifying the product goal: is this assistant answering internal knowledge questions, public user questions, customer support queries, or open-domain questions, and what is the cost of a wrong answer? They would also ask whether the evaluation target is retrieval alone or end-to-end answer quality, because those require different labels and metrics. The answer should be organized around four pillars: offline labeled evaluation, online product metrics, error analysis by query/document slices, and launch guardrails.

For offline evaluation, they would propose a representative query set sampled from logs, with human judgments for document or passage relevance, then track Recall@k, MRR, and NDCG@k. For end-to-end quality, they would add answer helpfulness, faithfulness, citation support, and hallucination rate. They would explicitly flag the tradeoff between high Recall@k and context pollution: increasing $k$ may retrieve the needed passage but can also add irrelevant text that causes the generator to answer incorrectly or increases latency and token cost. For online validation, they would recommend an A/B test measuring user success, follow-up reformulations, negative feedback, latency, and cost per resolved query. They would close by saying that, with more time, they would build a failure taxonomy for ambiguous queries, stale documents, entity mismatches, and cases where retrieval succeeded but generation failed.

A second angle

Diagnose a drop in answer quality after changing the vector index. The same evaluation framework applies, but the framing shifts from launch readiness to regression diagnosis. A strong answer would compare old versus new systems on identical query sets, separating ANN recall loss from reranker or generation effects. They would inspect Recall@k, overlap of retrieved documents, latency distribution, index freshness, embedding version compatibility, and whether rare-entity queries degraded more than semantic paraphrases. The key constraint is attribution: if end-to-end answer ratings dropped, you need intermediate metrics to show whether the vector index failed to retrieve relevant chunks or whether downstream generation changed. A good candidate would also recommend canarying the index change and monitoring slice-level regressions rather than relying on average quality.

Common pitfalls

Analytical mistake: optimizing only Precision@k. A tempting answer is “we should maximize precision so retrieved documents are relevant.” For RAG, this can be wrong if the system fails to include the one passage needed to answer; Recall@k or answer-support coverage may matter more, especially before reranking and context assembly.

Communication mistake: mixing retrieval and generation quality. Candidates often say “the retrieval is bad because the answer is hallucinated.” That may be true, but it is not proven; the right response decomposes the pipeline and asks whether relevant evidence was retrieved, whether it survived truncation, and whether the generator used it faithfully.

Depth mistake: ignoring no-answer and freshness cases. Many answers assume every query has a correct document in the corpus. In production, users ask unanswerable, ambiguous, outdated, or policy-sensitive questions; evaluation should measure abstention, stale evidence retrieval, and unsafe retrieval, not just relevance on answerable queries.

Connections

Interviewers may pivot from retrieval evaluation into experiment design, ranking metrics, recommender-system evaluation, or LLM hallucination measurement. Expect follow-ups on A/B testing, inter-annotator agreement, counterfactual logging bias, embedding model selection, approximate nearest-neighbor indexing, and safety evaluation for generated answers.