RAG, Semantic Retrieval, And Grounding Evaluation

What's being tested

Interviewers are testing whether you can reason end-to-end about retrieval-augmented generation, not just describe embeddings or LLMs. For a Meta Data Scientist, the key skill is choosing retrieval, ranking, and evaluation methods that improve user-facing product outcomes while controlling hallucination, latency, privacy, and cost. Expect probes on whether you can separate retrieval quality, generation quality, and grounding quality, then design metrics and experiments that diagnose each layer. The interviewer is usually looking for practical judgment: what to measure offline, what to A/B test online, and how to debug failures when aggregate metrics look good.

Core knowledge

RAG combines a retriever with a generator: retrieve top- $k$ passages/documents, inject them into the prompt, then generate an answer conditioned on that context. It is useful when knowledge changes frequently, needs attribution, or cannot fit reliably in model weights.
Semantic retrieval maps queries and documents into dense vectors using models like `Sentence-BERT`, `E5`, `DPR`, or OpenAI-style embedding models. Similarity is commonly cosine similarity:
$\cos(q,d)=\frac{q \cdot d}{\|q\|\|d\|}$
Dense retrieval handles paraphrase better than keyword search but can miss exact constraints.
Lexical retrieval with `BM25` remains a strong baseline, especially for rare entities, IDs, names, and exact product terms. `BM25` scores depend on term frequency, inverse document frequency, and document length normalization; hybrid retrieval often beats dense-only systems in production.
Hybrid retrieval combines dense and sparse scores via weighted sums, reciprocal rank fusion, or a learned ranker. Reciprocal rank fusion is robust:
$RRF(d)=\sum_i \frac{1}{k+r_i(d)}$
where $r_i(d)$ is the rank from retriever $i$ .
Approximate nearest neighbor search enables vector search at scale. `FAISS`, `ScaNN`, `HNSW`, `Milvus`, and `Pinecone` trade recall for latency. Brute-force search is fine up to ~1M vectors; beyond ~10M–100M, use `HNSW`, IVF-PQ, or sharding.
Chunking strategy strongly affects retrieval. Small chunks improve precision but may lose context; large chunks improve context but add noise and token cost. Common starting points are 200–500 tokens with overlap of 10–20%, then tune using recall@k and answer groundedness.
Retrieval metrics include `Recall@k`, `Precision@k`, `MRR`, and `NDCG`. For RAG, `Recall@k` is often primary because the generator cannot answer from evidence it never sees.
$Recall@k=\frac{\#\text{ relevant docs in top }k}{\#\text{ relevant docs}}$
Generation metrics must distinguish fluency from correctness. `BLEU` and `ROUGE` are weak for open-ended answers. Prefer human ratings, task-specific exact match, citation support, factual consistency checks, and LLM-as-judge only after calibration against human labels.
Grounding evaluation asks whether each generated claim is supported by retrieved evidence. Useful labels are supported, contradicted, or not enough information. A groundedness metric can be claim-level:
$Groundedness=\frac{\#\text{ supported claims}}{\#\text{ total factual claims}}$
Failure modes include retrieval miss, irrelevant retrieved context, context overflow, prompt injection, stale documents, conflicting sources, and generator hallucination despite correct evidence. Debug by logging query, retrieved chunks, ranks, scores, prompt, answer, citations, and user outcome.
Online evaluation should connect system quality to product metrics like `CTR`, successful task completion, session satisfaction, hide/report rate, retention, or support deflection. Guardrails should include latency `p95`, cost per query, unsafe answer rate, privacy incidents, and low-confidence escalation rate.
Experiment design matters because improvements may be localized. Segment by query type: entity-heavy, navigational, ambiguous, safety-sensitive, cold-start, multilingual, and long-tail. A system can improve average `NDCG` while worsening high-risk or high-value segments.

Worked example

“How would you evaluate whether a RAG answer is grounded in retrieved documents?”

A strong candidate would first clarify the product surface: is this for internal knowledge search, user-facing assistant answers, ads support, creator tools, or integrity workflows? They would also ask whether answers require citations, whether “I don’t know” is acceptable, and what the cost of a hallucination is. The answer should be organized around four pillars: define a claim-level labeling schema, build offline evaluation sets, instrument the system to isolate retrieval versus generation errors, and validate improvements with online product metrics.

For the offline setup, they would sample real queries, retrieve top- $k$ chunks, generate answers, then break answers into atomic factual claims. Each claim can be labeled as supported, contradicted, or unsupported by the retrieved evidence, ideally with human raters for a gold set and a calibrated LLM judge for scale. They would separately report retrieval `Recall@k`, citation precision, answer correctness, refusal quality, and groundedness, because one aggregate “RAG score” hides the failure source.

A specific tradeoff to flag is strict groundedness versus helpfulness. If the model only repeats retrieved text, groundedness may rise while user satisfaction falls; if it synthesizes aggressively, helpfulness may rise while unsupported claims increase. A good close would be: “If I had more time, I’d segment by query risk and ambiguity, then set stricter groundedness thresholds for high-impact domains and allow more synthesis for low-risk exploratory use cases.”

A second angle

“Design semantic retrieval for a large-scale knowledge base.”

The same concepts apply, but the emphasis shifts from answer validation to retrieval architecture and ranking. A strong answer would start with corpus size, update frequency, latency budget, language coverage, and whether exact entity matching is important. For a million documents, a dense index in `FAISS` or `HNSW` may be enough; for tens or hundreds of millions, the candidate should discuss sharding, quantization, hybrid `BM25` + dense retrieval, and re-ranking with a cross-encoder. The evaluation would focus first on `Recall@k`, `MRR`, and latency `p95`, then connect those to downstream RAG groundedness and user success. The key constraint difference is that retrieval systems fail silently: if relevant evidence is absent from top- $k$ , the generator may still produce a fluent but wrong answer.

Common pitfalls

Pitfall: Treating hallucination as only a generation problem.

A tempting answer is “use a better LLM” or “lower temperature.” That may help style consistency, but many hallucinations come from missing, stale, or irrelevant retrieved context; a stronger answer decomposes errors into retrieval miss, ranking failure, context construction, and generation behavior.

Pitfall: Over-relying on generic text similarity metrics.

Saying “we’ll use cosine similarity and `ROUGE`” is too shallow. Cosine similarity ranks candidates but does not prove relevance, and `ROUGE` rewards lexical overlap rather than factual support; better answers include `Recall@k`, `NDCG`, human factuality labels, citation precision, and claim-level groundedness.

Pitfall: Ignoring production constraints.

A technically correct pipeline that retrieves 100 chunks, re-ranks them all with a cross-encoder, and calls a large model repeatedly may be unusable at Meta-scale. Strong candidates mention latency `p95`, cost per request, caching, embedding refresh cadence, privacy boundaries, and safe fallback behavior.

Connections

Interviewers may pivot from here into ranking systems, LLM evaluation, A/B testing, or trust and safety measurement. Be ready to discuss counterfactual logging, interleaving tests for search, human annotation design, calibration of LLM-as-judge, and guardrail metrics for harmful or unsupported outputs.