Retrieval-Augmented Generation

What's being tested

Interviewers are probing whether you can reason about retrieval-augmented generation as an applied ML system: when it is preferable to fine-tuning, how to evaluate answer quality, and how to control hallucination, latency, and cost. For a Data Scientist, the emphasis is not on building the serving stack, but on defining success metrics, designing offline and online evaluations, diagnosing failure modes, and making evidence-based tradeoffs. Amazon cares because many internal and customer-facing products depend on accurate answers over changing catalogs, policies, reviews, support docs, and seller content. A strong answer connects model behavior to business risk: wrong answers, unsupported claims, poor coverage, high inference cost, and degraded customer trust.

Core knowledge

RAG pipeline anatomy: a typical system has document selection, chunking, embedding, vector retrieval, optional lexical retrieval, reranking, prompt construction, generation, and post-generation validation. As a DS, focus on which stage explains failures: missing document, bad chunk, weak ranking, poor prompt grounding, or model hallucination.
RAG vs. fine-tuning: use RAG when knowledge changes frequently, answers require citations, or the source corpus is large and dynamic. Use fine-tuning when you need style, task format, domain-specific reasoning patterns, or classification behavior. Fine-tuning usually does not reliably “store” thousands of facts and can still hallucinate.
Retrieval metrics: evaluate retrieval before generation using Recall@k, Precision@k, MRR, and nDCG@k. If the correct supporting passage appears in the top $k$ , retrieval recall is high; if it appears near rank 1, MRR and nDCG improve. Poor retrieval caps final answer quality no matter how strong the LLM is.
Answer-quality metrics: evaluate final responses on faithfulness, answer correctness, citation accuracy, coverage, refusal quality, and helpfulness. For factual systems, split “is the answer true?” from “is the answer supported by retrieved context?” because a true answer can still be ungrounded.
Human evaluation design: create a labeled test set stratified by query type: lookup, comparison, multi-hop, ambiguous, out-of-scope, freshness-sensitive, and adversarial. Use blinded raters, rubric-based labels, inter-rater agreement such as Cohen’s kappa, and adjudication for ambiguous cases.
Offline-to-online gap: offline metrics like Recall@5 and judge-rated correctness are necessary but not sufficient. Online metrics may include CTR, task completion, deflection rate, escalation rate, answer acceptance, repeat-contact rate, refund/contact outcomes, and guardrail metrics like harmful-answer rate.
Cost metrics: total expected cost per query is roughly $C = C_\text{embed} + k C_\text{rerank} + T_\text{input} c_\text{in} + T_\text{output} c_\text{out},$ where $k$ is retrieved candidates and $T$ is token count. DS tradeoffs include reducing top_k, compressing context, using cheaper rerankers, caching frequent answers, or routing simple queries to smaller models.
Chunking tradeoffs: small chunks improve precise retrieval but can lose context; large chunks preserve context but add noise and token cost. Common starting points are 200–800 tokens with overlap, then tune using retrieval recall and downstream answer accuracy rather than arbitrary chunk size.
Embedding tradeoffs: higher-dimensional embeddings can improve semantic resolution but increase storage, retrieval cost, and risk of overfitting to benchmark-like queries. Compare embedding models with a fixed evaluation set, including domain-specific synonyms, abbreviations, multilingual queries, and entity-heavy queries.
Hybrid retrieval: dense retrieval captures semantic similarity, while BM25 or lexical retrieval is better for exact product IDs, policy names, error codes, and rare entities. A hybrid system with reranking often beats either alone, especially for Amazon-like catalogs and support documents with many near-duplicate entities.
Reranking role: a cross-encoder reranker scores query-document pairs more accurately than vector similarity but is slower and costlier. Use it on a candidate pool, e.g. retrieve top 50–200, rerank to top 5–10, then pass only the most relevant passages to the LLM.
Grounding and refusal: good systems explicitly handle “answer not in context.” Measure false-answer rate on unanswerable queries, not just accuracy on answerable ones. The prompt should instruct the model to cite evidence and refuse unsupported claims, but prompt instructions are not a substitute for evaluation.

Worked example

For “Design and evaluate a RAG system,” start by framing the use case: “What corpus are we answering from, how fresh is it, what is the cost of a wrong answer, and do we need citations or just conversational help?” Then declare assumptions, such as a customer-support assistant over policy and troubleshooting documents where correctness and groundedness matter more than creativity. Organize the answer into four pillars: data and query taxonomy, retrieval quality, generation quality, and online experiment design.

For retrieval, say you would build an offline benchmark of real and synthetic queries with gold supporting documents, then track Recall@k, MRR, and coverage by segment. For generation, evaluate answer correctness, faithfulness to retrieved context, citation precision, refusal behavior, and latency/cost per resolved query. A concrete tradeoff to flag is top_k: increasing it may improve recall but can add irrelevant context, raise token cost, and sometimes reduce answer faithfulness. For launch, propose an A/B test against the current experience with primary metrics like successful resolution or accepted answer rate, guardrails like escalation rate and complaint rate, and segmented analysis for long-tail topics. Close by saying that, with more time, you would add error taxonomy reviews: no relevant doc retrieved, relevant doc retrieved but ignored, conflicting docs, stale source, and ambiguous user intent.

A second angle

For “Choose Between Fine-Tuning and RAG for Client Chatbot,” the same concepts apply, but the decision is framed as model adaptation rather than system evaluation. A strong answer says RAG is the default if the chatbot must answer from changing client documents, provide citations, or support auditability. Fine-tuning is more appropriate if the main gap is tone, output schema, intent classification, or domain-specific phrasing. The best answer often combines them: RAG for factual grounding and a lightly fine-tuned or instruction-tuned model for consistent behavior. The evaluation should compare variants on the same labeled query set and include cost per successful resolution, not just model accuracy.

Common pitfalls

Pitfall: Treating RAG as “just add a vector database.”

That answer is too shallow for a Data Scientist interview because it skips measurement. A better answer decomposes performance into retrieval recall, ranking quality, grounding, and end-user outcome metrics, then explains how each would be evaluated and improved.

Pitfall: Optimizing only average answer accuracy.

Average accuracy can hide severe failures on high-risk or low-frequency segments such as policy exceptions, medical/legal disclaimers, seller disputes, or fresh catalog changes. Segment by query type, document domain, language, customer cohort, and answerability; also track worst-case or tail metrics like unsupported-answer rate.

Pitfall: Claiming fine-tuning “teaches the model the knowledge base.”

Fine-tuning can improve format and behavior, but it is unreliable for frequently changing facts and hard to audit. For factual enterprise chatbots, say that RAG provides freshness and traceability, while fine-tuning may complement it for style, routing, or specialized reasoning patterns.

Connections

Interviewers may pivot from here to ranking evaluation, LLM hallucination measurement, A/B testing, semantic search, or transformer attention. They may also ask about RNNs vs. Transformers to check whether you understand why modern retrieval and generation systems rely on attention-based models for long-context language tasks.

What's being tested

Core knowledge

Worked example

A second angle

Common pitfalls

Connections

Further reading

Featured in interview prep guides

Practice questions

Related concepts