LLM Retrieval, RAG, And Understanding

What's being tested

Interviewers are probing whether you can reason end-to-end about retrieval-augmented generation: turning messy corpora into useful context, measuring retrieval quality, reducing hallucination, and evaluating product impact. For a Data Scientist at Meta, the key skill is not “knowing what RAG stands for,” but being able to choose retrieval and evaluation strategies under latency, privacy, safety, and scale constraints. They want to see whether you can separate model quality from retrieval quality, design offline and online metrics, and diagnose failures systematically. Strong answers connect technical choices to user outcomes like answer helpfulness, trust, resolution rate, engagement, and safety incidents.

Core knowledge

RAG decomposes into ingestion, chunking, embedding, indexing, retrieval, reranking, prompt construction, generation, and evaluation. A useful mental model is: if the answer is wrong, determine whether the failure came from missing documents, bad ranking, poor context packing, or generation behavior.
Dense retrieval uses embedding similarity, commonly cosine similarity or dot product:
$\text{cosine}(q, d)=\frac{q \cdot d}{\lVert q\rVert \lVert d\rVert}$
Dense retrievers capture semantic similarity but may miss exact identifiers, names, URLs, policy IDs, or rare terms where BM25 often performs better.
Sparse retrieval, especially BM25, is still highly competitive for factual and keyword-heavy queries. BM25 scores documents using term frequency, inverse document frequency, and length normalization. Hybrid retrieval combines sparse and dense scores, often via weighted sum or reciprocal rank fusion:
$RRF(d)=\sum_i \frac{1}{k + \text{rank}_i(d)}$
For small corpora up to roughly 1M–10M vectors, exact nearest-neighbor search using FAISS IndexFlatIP or IndexFlatL2 may be feasible depending on latency and hardware. At larger scale, use approximate nearest neighbor indexes such as HNSW, IVF, ScaNN, or product quantization to trade recall for latency and memory.
Chunking matters as much as the model. Common strategies include fixed token windows, semantic segmentation, document-structure-aware chunks, and overlapping windows. Smaller chunks improve retrieval precision; larger chunks preserve context. Typical chunk sizes range from 200–800 tokens with 10–20% overlap, but should be tuned empirically.
Retrieval quality should be evaluated before generation. Standard offline metrics include Recall@k, Precision@k, MRR, and NDCG:
$\text{Recall@k}=\frac{\# \text{relevant docs retrieved in top k}}{\# \text{relevant docs}}$
For RAG, Recall@k is often more important than Precision@k because the generator cannot use evidence it never sees.
Generation quality requires separate metrics: factuality, groundedness, answer relevance, completeness, toxicity, and refusal correctness. Exact-match or ROUGE may be weak for open-ended answers; human evaluation, LLM-as-judge with calibration, citation correctness, and claim-level entailment checks are often more useful.
Reranking is a common quality upgrade. Retrieve 50–200 candidates cheaply using BM25/dense/hybrid, then rerank the top candidates using a cross-encoder or LLM-based reranker. This improves precision but adds latency; it is often reserved for high-value queries or cached frequent queries.
Prompt construction is a constrained optimization problem. You must select context under a token budget, deduplicate near-identical chunks, preserve source metadata, and order evidence. Naively stuffing top-k chunks can degrade answers if irrelevant chunks distract the model or push critical evidence out of the window.
Hallucination mitigation requires both retrieval and generation controls. Use citations, require the model to answer only from provided evidence, include abstention behavior, run post-generation attribution checks, and track unsupported-claim rate. A good system says “I don’t know” when retrieval confidence is low.
Data freshness and permissions are critical at Meta scale. Indexes must handle document updates, deletions, language changes, and access control lists. Retrieval should enforce permissions before context reaches the model; filtering after generation is too late because private content may already have influenced the answer.
Online success metrics should reflect user and business value, not just model scores. Depending on the product, measure successful resolution rate, follow-up question rate, thumbs-up/down, session length, escalation to human support, report rate, latency p95/p99, cost per query, and safety violation rate.

Worked example

Design a RAG system for Meta AI to answer questions using Help Center articles

A strong candidate would start by clarifying scope: “Are we answering only from official Help Center content, or also community posts and policy documents? What languages, freshness requirements, and safety constraints matter? Is the goal self-service resolution, reduced support load, or conversational engagement?” I would then state assumptions: official articles are the trusted corpus, answers require citations, the system must support frequent updates, and latency should be acceptable for an interactive assistant. The answer should be organized around four pillars: corpus ingestion and chunking, hybrid retrieval and reranking, grounded generation with citations, and evaluation/monitoring.

For ingestion, I would parse articles into structured chunks using headings, product area, locale, timestamp, and permission metadata rather than arbitrary text splits. For retrieval, I would use hybrid BM25 plus dense embeddings, retrieve perhaps top 100 candidates, rerank to top 5–10, then pack evidence into the prompt with source titles and URLs. For generation, I would instruct the model to answer only from retrieved context, cite sources, and abstain when evidence is insufficient. Offline evaluation would include Recall@k on labeled query-document pairs and answer groundedness on a human-labeled set; online evaluation would include resolution rate, helpfulness, follow-up rate, and latency.

A specific tradeoff to flag is chunk size: smaller chunks improve ranking precision and citation granularity, but may omit necessary context like exceptions or eligibility criteria. I would close by saying that, with more time, I would add multilingual evaluation, freshness tests after document updates, and guardrails for policy-sensitive content such as account recovery or minors’ safety.

A second angle

How would you measure whether RAG improved answer quality?

Here the framing shifts from system design to causal measurement and metric selection. I would separate offline component metrics from online product metrics: retrieval Recall@k and NDCG tell us whether the right evidence is available, while groundedness and human preference scores tell us whether the generated answer used that evidence correctly. For launch evaluation, I would run an A/B test comparing the baseline assistant to the RAG-enhanced assistant, with primary metrics such as successful resolution rate or user-rated helpfulness and guardrail metrics like hallucination reports, latency, and cost. I would also segment by query type, language, product area, and freshness sensitivity because aggregate improvements can hide regressions for rare but important support intents. The key constraint is that user feedback is biased and sparse, so I would combine behavioral metrics, human audits, and targeted golden datasets.

Common pitfalls

Analytical mistake: treating answer quality as one metric.
A tempting answer is “we’ll measure accuracy” without decomposing accuracy into retrieval recall, ranking precision, groundedness, completeness, and user satisfaction. A stronger answer isolates failure modes: if the right document was not retrieved, tune indexing and retrieval; if it was retrieved but ignored, tune prompt construction or generation.

Communication mistake: jumping straight to vector databases.
Many candidates over-index on “use embeddings and a vector DB” as if that solves the product problem. Interviewers expect you to discuss corpus quality, permissions, freshness, evaluation, latency, and fallback behavior. Vector search is one component, not the design.

Depth mistake: ignoring edge cases where dense retrieval fails.
Dense embeddings can perform poorly on exact product names, account IDs, policy numbers, URLs, or very recent terms absent from embedding training data. Mentioning BM25, hybrid retrieval, metadata filters, and freshness-aware indexing shows practical depth.

Connections

Expect pivots into ranking systems, search evaluation, recommender metrics, online experimentation, and LLM safety. If the interviewer pushes on measurement, be ready to discuss A/B testing, inter-rater reliability for human labels, LLM-as-judge calibration, and guardrail metrics. If they push on infrastructure, adjacent topics include ANN indexing, caching, latency/cost tradeoffs, privacy-preserving retrieval, and data deletion compliance.