## Scenario
You are asked to design a Retrieval-Augmented Generation (RAG) system that answers user questions using a private corpus (e.g., internal docs, PDFs, knowledge base articles). The interviewer wants you to walk through **each component** and explain **how you would evaluate each step**.
## Requirements
- Support natural-language Q&A over private documents.
- Handle frequent document updates (new/changed docs).
- Provide citations or traceability to sources.
- Low latency for interactive use.
- Reduce hallucinations and ensure answers are grounded in retrieved context.
## What to cover
1. End-to-end architecture and data flow.
2. Document ingestion and preprocessing (parsing, cleaning, chunking).
3. Embedding strategy and indexing (vector DB / hybrid search).
4. Retrieval (query understanding, top-k, filters) and optional reranking.
5. Prompting/context assembly and generation.
6. Safety/guardrails and fallback behavior when retrieval is weak.
7. Evaluation plan for:
- ingestion/chunking quality
- retrieval quality
- reranking quality (if used)
- generation quality and grounding
- end-to-end user success
8. Online monitoring and continuous improvement loop.
Quick Answer: This question evaluates expertise in designing Retrieval-Augmented Generation (RAG) systems, covering end-to-end architecture, document ingestion and preprocessing, embedding and indexing strategies, retrieval and reranking, prompt/context assembly, safety/fallbacks, and per-component evaluation.
Solution
### 1) Clarify scope and assumptions
Start by asking:
- Corpus size (number of docs/pages), formats (PDF/HTML/Markdown), update frequency.
- Typical queries (fact lookup vs multi-hop reasoning vs summarization).
- Latency budget (e.g., p95 < 2s) and cost constraints.
- Compliance needs (PII, access control), multi-tenant needs.
- Output requirements: citations, quotes, structured JSON, etc.
Assume: internal docs, need citations, moderate scale (10^5–10^7 chunks), interactive latency.
---
### 2) High-level architecture (components)
**Offline / batch (or streaming) pipeline**
1. **Ingestion**: fetch docs from sources (S3/Drive/Confluence/Git).
2. **Parsing & normalization**: extract text, preserve structure (headings, tables if possible).
3. **Chunking**: split text into retrievable units with metadata.
4. **Embedding**: compute vector embeddings for chunks (and optionally for titles/headers).
5. **Indexing**: store vectors + metadata in a vector store; optionally build lexical index (BM25) too.
**Online query pipeline**
1. **Auth + policy**: apply access control filters.
2. **Query understanding**: rewrite/expand query; detect intent; extract filters.
3. **Retrieval**: vector search (and/or hybrid BM25+vector) → top-k candidates.
4. **Reranking (optional)**: cross-encoder reranker or LLM-based rerank.
5. **Context assembly**: dedupe, compress, select passages; attach citations.
6. **Generation**: grounded answer with instructions to cite and abstain if unsupported.
7. **Post-processing**: safety filters, formatting, source list, confidence.
8. **Logging**: store query, retrieved docs, model output, latency, user feedback.
---
### 3) Ingestion, parsing, chunking
**Parsing**
- Use format-specific parsers; preserve page numbers, headings, and source URLs.
- Extract tables carefully (either linearize or store as structured text).
**Chunking strategy**
- Common baseline: 200–400 tokens with 10–20% overlap.
- Prefer *semantic* chunking: split by headings/sections; keep coherent units.
- Add metadata: doc_id, section title, timestamp, ACL tags, source link.
**Pitfalls**
- Too small chunks → lose context; too large → fewer retrieved units fit into prompt.
- PDF extraction noise; duplicated boilerplate; headers/footers.
---
### 4) Embeddings and indexing
**Embedding choices**
- Use strong text embeddings; consider domain adaptation if jargon-heavy.
- Store multiple representations if needed: chunk text embedding + title embedding.
**Index**
- Vector index (HNSW/IVF) + metadata filtering (tenant, ACL, doc type, time).
- Consider **hybrid search**: BM25 for exact matches + vector for semantics.
**Freshness**
- Incremental updates: re-embed changed chunks; tombstone deleted docs.
- Keep embedding/model versioning for reproducibility.
---
### 5) Retrieval and reranking
**Retrieval**
- Top-k retrieval (e.g., k=20–100) with filters.
- Query rewriting: expand acronyms, convert question to search query.
- Multi-query retrieval: generate 3–5 query variants and merge results.
**Reranking (optional but common)**
- Cross-encoder reranker on query–chunk pairs to improve precision.
- LLM reranking for small candidate sets when budget allows.
**Context assembly**
- Deduplicate near-identical chunks.
- Prefer diverse sources if answering broad questions.
- Use *context compression*: summarize or extract only relevant sentences to fit token budget.
---
### 6) Generation and grounding
**Prompting**
- System instruction: answer only using provided context; cite sources; say “I don’t know” if missing.
- Provide a clear citation format (e.g., [DocTitle §Section](URL)).
**Hallucination mitigation**
- Refusal/abstention policy based on retrieval confidence (e.g., if top score < threshold).
- Ask clarifying questions when query is underspecified.
**Guardrails**
- PII redaction, policy filters, safe completion templates.
- Enforce access control at retrieval time and again before final answer.
---
### 7) Evaluation: per-component and end-to-end
You want both **offline** (repeatable) and **online** (real user) evaluation.
#### A) Chunking / ingestion evaluation
Goal: chunks should be coherent and retrievable.
- Manual audits on a sampled set: coherence, duplication rate, metadata correctness.
- Automated checks: average chunk length, overlap %, parser error rates.
- Regression tests on known “hard” documents (tables, PDFs).
#### B) Retrieval evaluation (core of RAG)
Create a labeled dataset: (query, relevant chunks/docs).
- **Recall@k**: fraction of queries where at least one relevant chunk is in top-k.
- **MRR / nDCG**: reward correct ranking order.
- Slice metrics by doc type, query type, tenant, freshness.
If you lack labels:
- Use weak supervision: click logs, human annotation on top results, or synthetic Q/A pairs generated from docs (with human spot checks).
#### C) Reranker evaluation
- Compare precision-focused metrics (e.g., nDCG@10) before vs after rerank.
- Measure latency/cost tradeoff.
- Error analysis: reranker bias toward longer passages, keyword overfitting.
#### D) Generation evaluation (with grounding)
You need to separate:
1) **Answer correctness** (does it answer the question?)
2) **Faithfulness/grounding** (is it supported by retrieved context?)
3) **Citation quality** (do citations actually back the claims?)
Methods:
- Human grading rubric (best early on): correctness, completeness, groundedness, readability.
- LLM-as-judge with guardrails + periodic human calibration.
- Automated checks:
- “Attribution”: require each sentence to map to at least one cited chunk.
- Contradiction detection between answer and context (imperfect but useful).
#### E) End-to-end evaluation
- Task success rate, user satisfaction, deflection rate (if used for support).
- Latency p50/p95, cost per query.
- Abstention rate: too high hurts usefulness; too low increases hallucinations.
---
### 8) Online monitoring and iteration
**Logging**
- Query text, rewritten query, retrieved doc IDs, scores, rerank scores, final prompt tokens, output.
**Online metrics**
- CTR on cited sources, thumbs up/down, “answer helpful”.
- Drift: embedding distribution shifts, retrieval score distribution, doc freshness.
**Feedback loop**
- Collect failure cases; add to eval set.
- Tune chunking, k, reranker thresholds.
- Add domain-specific synonyms/expansions or structured metadata filters.
---
### 9) Common edge cases to mention
- Conflicting sources: select most recent/authoritative; show multiple citations.
- Multi-hop questions: iterative retrieval (retrieve → draft → retrieve again).
- Access control: per-user ACL filtering is non-negotiable.
- Very long docs: hierarchical retrieval (doc-level → section-level → chunk-level).
This structure (pipeline + per-step evaluation + monitoring) directly answers “walk every component and how to evaluate each step.”