Retrieval Evaluation for LLM RAG and Search
Asked of: Data Scientist
Last updated

-
What it is Retrieval evaluation measures how well a system finds and ranks the right context for a question, and how much that context improves the final answer. In RAG and search stacks, you typically evaluate the retrieval component separately (offline IR metrics) and end-to-end answer quality (groundedness/faithfulness). (ibm.com)
-
Why interviewers ask about it At companies like Meta, data scientists define metrics, gold sets, and A/B tests that drive product decisions for Search, Q&A, and assistants. Clear retrieval evaluation enables faster iteration, targeted debugging (retriever vs reranker vs generator), and defensible decisions about quality, cost, and latency.
-
Core ideas to know
- Separate evaluation: retrieval metrics for candidate ranking; generation metrics for answer quality; diagnose where failures originate. (ibm.com)
- Retrieval metrics: Recall@k, Precision@k, MRR, nDCG capture coverage and ranking quality; require relevance judgments per query. (ibm.com)
- Answer-level metrics: faithfulness/groundedness and answer relevance; can be automated with reference-free frameworks like RAGAS. (arxiv.org)
- Build realistic eval sets: mix easy/hard queries, near-duplicates, and adversarial phrasing; include explicit negatives and multi-label relevance. (arxiv.org)
- Online signals: clicks, dwell time, task success, correction rate; couple with cost/latency budgets and monitor drift.
- Attribution: log retrieved passages, ranks, reranker scores, and citations to tell “missed recall” from “unused context.”
- Statistical rigor: holdout queries, bootstrap CIs, and pre-registered thresholds; avoid overfitting by rotating/refreshing gold sets.
-
A common pitfall Candidates conflate vector similarity with retrieval quality and report only Precision@k on synthetic questions. Interviewers probe whether you validated labels, handled partial relevance, and tested with hard negatives. Another trap is relying solely on an LLM-as-judge without spot human review, which can mask grounding errors and overstate gains. Strong answers show how offline IR metrics correlate with online outcomes and how you’d debug failures by stage. (ibm.com)
-
Further reading
- IBM — Result Evaluation for RAG: Metrics & Best Practices (clear retrieval vs generation metrics, pros/cons, and practical guidance) https://www.ibm.com/think/architectures/rag-cookbook/result-evaluation (ibm.com)
- BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of IR Models (foundational IR metrics and multi-domain benchmark design) https://arxiv.org/abs/2104.08663 (arxiv.org)
- RAGAS: Automated Evaluation of Retrieval-Augmented Generation (reference-free evaluation for faithfulness and answer relevance) https://arxiv.org/abs/2309.15217 (arxiv.org)