Design and optimize a RAG system
Company: OpenAI
Role: Machine Learning Engineer
Category: ML System Design
Difficulty: hard
Interview Round: Onsite
## Scenario
You are building a **Retrieval-Augmented Generation (RAG)** system for question answering over an internal document corpus (engineering wikis, design docs, runbooks, support tickets). Users ask natural-language questions in an interactive chat surface and expect grounded answers **with citations** back to source documents.
## Task
Design the **end-to-end architecture** and describe the **optimization strategies** you would use to make this system high-quality, low-latency, and trustworthy in production.
Your design should cover, at minimum:
- The full set of components: **ingestion, parsing, chunking, embeddings, indexing, retrieval, reranking, and generation**.
- How you **improve retrieval relevance** and **reduce hallucinations**.
- An **evaluation plan** (offline + online) and **monitoring** for corpus and embedding drift.
The corpus is updated continuously (new and edited documents), contains long documents, and spans heterogeneous formats (PDF, HTML, wiki pages, Markdown).
```hint Where to start
Frame it as a two-stage pipeline: an **offline/asynchronous indexing path** (ingest → parse → chunk → embed → index) and an **online query path** (query understanding → retrieve → rerank → generate → verify). Sketch both separately so latency budgets and freshness concerns live in the right place.
```
```hint Retrieval quality
The single highest-leverage relevance move is usually **hybrid retrieval** (sparse lexical like BM25 + dense vector) fused with a method like Reciprocal Rank Fusion, followed by a **cross-encoder reranker** on the top-N candidates. Think about why pure dense retrieval fails on exact tokens (error codes, IDs, version strings).
```
```hint Grounding
Reducing hallucination is not one trick but a chain: **grounded prompting** ("answer only from the provided sources"), **citation enforcement**, **confidence gating** (abstain when top scores are low), and an optional **faithfulness/entailment check** on the generated answer against the retrieved passages.
```
### Constraints & Assumptions
State your own numbers explicitly, but a reasonable baseline to design against:
- Corpus on the order of **10^6–10^7 chunks** after splitting.
- Interactive latency target: **p95 end-to-end < 2 s** (retrieval + rerank + first token), with answer streaming.
- Ingestion freshness SLA: a new or edited document is **retrievable within minutes**, not hours.
- Documents carry **access-control metadata (ACLs)** — not every user may see every document.
- You may assume access to a managed or self-hosted **vector index**, a **lexical/BM25 index**, an embedding model, a cross-encoder reranker, and a generation LLM.
### Clarifying Questions to Ask
A strong candidate scopes the problem before designing. For example:
- What is the **expected query volume** (QPS) and the **read/write ratio** (query rate vs. ingestion rate)?
- Are answers **single-turn** or **conversational** (does the retriever need to resolve follow-up references against chat history)?
- How strict are the **access-control and data-isolation** requirements — must we prevent even leaking the *existence* of a forbidden document?
- What is the **cost budget** per query (embedding calls, reranker calls, generation tokens)?
- How will quality be judged — is there an existing **labeled eval set**, or do we need to bootstrap one?
- What is the tolerance for **"I don't know" / abstention** answers versus always producing something?
### What a Strong Answer Covers
The interviewer is looking for breadth across the pipeline *and* depth on the parts that matter most for RAG quality. Dimensions to hit:
- **End-to-end component decomposition** with a clear split between the asynchronous indexing path and the synchronous query path, and where each latency/freshness budget lives.
- **Parsing & chunking strategy** for heterogeneous, long documents — structure-aware splitting, chunk size/overlap tradeoffs, and preserving tables/code/headings.
- **Retrieval design** — hybrid (sparse + dense) retrieval, metadata filtering (including ACL filtering), fusion, and a reranking stage, with the latency/quality tradeoff articulated.
- **Generation & grounding** — prompt construction, citation enforcement, confidence gating / abstention, and post-hoc faithfulness verification.
- **Continuous ingestion** — incremental indexing, handling edits/deletes (not just appends), and index freshness.
- **Evaluation** — separate retrieval metrics (Recall@K, nDCG) from answer metrics (correctness, faithfulness/groundedness), plus an online feedback loop.
- **Monitoring & drift** — index freshness, retrieval score distributions, query/corpus drift, and how failures feed back into the eval set.
- **Tradeoff reasoning** — the candidate should name where they're trading latency for quality or cost, not present one fixed answer.
### Follow-up Questions
- A user reports the system **confidently cited the wrong document**. Walk through how you'd debug whether the failure is in **retrieval** (wrong chunks surfaced) or **generation** (right chunks, bad synthesis), and what you'd change for each.
- Your embedding model is **upgraded to a new version**. What is your re-indexing and rollout plan, and how do you avoid a mixed-embedding-space index where old and new vectors are incomparable?
- How would you extend the system to support **multi-hop questions** whose answer requires combining facts from several documents?
- The corpus contains **near-duplicate documents** (e.g., copies of the same runbook). How does this hurt retrieval, and how would you handle it?
Quick Answer: This question evaluates knowledge and competency in designing and optimizing Retrieval-Augmented Generation (RAG) systems, including components like ingestion, chunking, embeddings, indexing, retrieval, reranking, generation, and evaluation.