How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

What difficulty level is this interview question?

This is a hard difficulty ML System Design question, commonly asked during Onsite rounds at OpenAI.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at OpenAI during technical interviews.

Design and optimize a RAG system | OpenAI Interview Question

Q: Design and optimize a RAG system

This question evaluates knowledge and competency in designing and optimizing Retrieval-Augmented Generation (RAG) systems, including components like ingestion, chunking, embeddings, indexing, retrieval, reranking, generation, and evaluation.

Scenario

You are building a Retrieval-Augmented Generation (RAG) system for question answering over an internal document corpus (engineering wikis, design docs, runbooks, support tickets). Users ask natural-language questions in an interactive chat surface and expect grounded answers with citations back to source documents.

Task

Design the end-to-end architecture and describe the optimization strategies you would use to make this system high-quality, low-latency, and trustworthy in production.

Your design should cover, at minimum:

The full set of components: ingestion, parsing, chunking, embeddings, indexing, retrieval, reranking, and generation .
How you improve retrieval relevance and reduce hallucinations .
An evaluation plan (offline + online) and monitoring for corpus and embedding drift.

The corpus is updated continuously (new and edited documents), contains long documents, and spans heterogeneous formats (PDF, HTML, wiki pages, Markdown).

Constraints & Assumptions

State your own numbers explicitly, but a reasonable baseline to design against:

Corpus on the order of 10^6–10^7 chunks after splitting.
Interactive latency target: p95 end-to-end < 2 s (retrieval + rerank + first token), with answer streaming.
Ingestion freshness SLA: a new or edited document is retrievable within minutes , not hours.
Documents carry access-control metadata (ACLs) — not every user may see every document.
You may assume access to a managed or self-hosted vector index , a lexical/BM25 index , an embedding model, a cross-encoder reranker, and a generation LLM.

Clarifying Questions to Ask

A strong candidate scopes the problem before designing. For example:

What is the expected query volume (QPS) and the read/write ratio (query rate vs. ingestion rate)?
Are answers single-turn or conversational (does the retriever need to resolve follow-up references against chat history)?
How strict are the access-control and data-isolation requirements — must we prevent even leaking the existence of a forbidden document?
What is the cost budget per query (embedding calls, reranker calls, generation tokens)?
How will quality be judged — is there an existing labeled eval set , or do we need to bootstrap one?
What is the tolerance for "I don't know" / abstention answers versus always producing something?

What a Strong Answer Covers

The interviewer is looking for breadth across the pipeline and depth on the parts that matter most for RAG quality. Dimensions to hit:

End-to-end component decomposition with a clear split between the asynchronous indexing path and the synchronous query path, and where each latency/freshness budget lives.
Parsing & chunking strategy for heterogeneous, long documents — structure-aware splitting, chunk size/overlap tradeoffs, and preserving tables/code/headings.
Retrieval design — hybrid (sparse + dense) retrieval, metadata filtering (including ACL filtering), fusion, and a reranking stage, with the latency/quality tradeoff articulated.
Generation & grounding — prompt construction, citation enforcement, confidence gating / abstention, and post-hoc faithfulness verification.
Continuous ingestion — incremental indexing, handling edits/deletes (not just appends), and index freshness.
Evaluation — separate retrieval metrics (Recall@K, nDCG) from answer metrics (correctness, faithfulness/groundedness), plus an online feedback loop.
Monitoring & drift — index freshness, retrieval score distributions, query/corpus drift, and how failures feed back into the eval set.
Tradeoff reasoning — the candidate should name where they're trading latency for quality or cost, not present one fixed answer.

Follow-up Questions

A user reports the system confidently cited the wrong document . Walk through how you'd debug whether the failure is in retrieval (wrong chunks surfaced) or generation (right chunks, bad synthesis), and what you'd change for each.
Your embedding model is upgraded to a new version . What is your re-indexing and rollout plan, and how do you avoid a mixed-embedding-space index where old and new vectors are incomparable?
How would you extend the system to support multi-hop questions whose answer requires combining facts from several documents?
The corpus contains near-duplicate documents (e.g., copies of the same runbook). How does this hurt retrieval, and how would you handle it?

Scenario

Task

Design the end-to-end architecture and describe the optimization strategies you would use to make this system high-quality, low-latency, and trustworthy in production.

Your design should cover, at minimum:

The full set of components: ingestion, parsing, chunking, embeddings, indexing, retrieval, reranking, and generation .
How you improve retrieval relevance and reduce hallucinations .
An evaluation plan (offline + online) and monitoring for corpus and embedding drift.

The corpus is updated continuously (new and edited documents), contains long documents, and spans heterogeneous formats (PDF, HTML, wiki pages, Markdown).

Constraints & Assumptions

State your own numbers explicitly, but a reasonable baseline to design against:

Corpus on the order of 10^6–10^7 chunks after splitting.
Interactive latency target: p95 end-to-end < 2 s (retrieval + rerank + first token), with answer streaming.
Ingestion freshness SLA: a new or edited document is retrievable within minutes , not hours.
Documents carry access-control metadata (ACLs) — not every user may see every document.
You may assume access to a managed or self-hosted vector index , a lexical/BM25 index , an embedding model, a cross-encoder reranker, and a generation LLM.

Clarifying Questions to Ask

A strong candidate scopes the problem before designing. For example:

What is the expected query volume (QPS) and the read/write ratio (query rate vs. ingestion rate)?
Are answers single-turn or conversational (does the retriever need to resolve follow-up references against chat history)?
How strict are the access-control and data-isolation requirements — must we prevent even leaking the existence of a forbidden document?
What is the cost budget per query (embedding calls, reranker calls, generation tokens)?
How will quality be judged — is there an existing labeled eval set , or do we need to bootstrap one?
What is the tolerance for "I don't know" / abstention answers versus always producing something?

What a Strong Answer Covers

The interviewer is looking for breadth across the pipeline and depth on the parts that matter most for RAG quality. Dimensions to hit:

End-to-end component decomposition with a clear split between the asynchronous indexing path and the synchronous query path, and where each latency/freshness budget lives.
Parsing & chunking strategy for heterogeneous, long documents — structure-aware splitting, chunk size/overlap tradeoffs, and preserving tables/code/headings.
Retrieval design — hybrid (sparse + dense) retrieval, metadata filtering (including ACL filtering), fusion, and a reranking stage, with the latency/quality tradeoff articulated.
Generation & grounding — prompt construction, citation enforcement, confidence gating / abstention, and post-hoc faithfulness verification.
Continuous ingestion — incremental indexing, handling edits/deletes (not just appends), and index freshness.
Evaluation — separate retrieval metrics (Recall@K, nDCG) from answer metrics (correctness, faithfulness/groundedness), plus an online feedback loop.
Monitoring & drift — index freshness, retrieval score distributions, query/corpus drift, and how failures feed back into the eval set.
Tradeoff reasoning — the candidate should name where they're trading latency for quality or cost, not present one fixed answer.

Follow-up Questions

A user reports the system confidently cited the wrong document . Walk through how you'd debug whether the failure is in retrieval (wrong chunks surfaced) or generation (right chunks, bad synthesis), and what you'd change for each.
Your embedding model is upgraded to a new version . What is your re-indexing and rollout plan, and how do you avoid a mixed-embedding-space index where old and new vectors are incomparable?
How would you extend the system to support multi-hop questions whose answer requires combining facts from several documents?
The corpus contains near-duplicate documents (e.g., copies of the same runbook). How does this hurt retrieval, and how would you handle it?

Design and optimize a RAG system

Quick Overview

Design and optimize a RAG system

Scenario

Task

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Submit Your Answer to Earn 20XP

Design and optimize a RAG system

Quick Overview

Design and optimize a RAG system

Scenario

Task

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers

Follow-up Questions

Submit Your Answer to Earn 20XP