PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/ML System Design/OpenAI

Design and optimize a RAG system

Last updated: Jun 24, 2026

Quick Overview

This question evaluates knowledge and competency in designing and optimizing Retrieval-Augmented Generation (RAG) systems, including components like ingestion, chunking, embeddings, indexing, retrieval, reranking, generation, and evaluation.

  • hard
  • OpenAI
  • ML System Design
  • Machine Learning Engineer

Design and optimize a RAG system

Company: OpenAI

Role: Machine Learning Engineer

Category: ML System Design

Difficulty: hard

Interview Round: Onsite

## Scenario You are building a **Retrieval-Augmented Generation (RAG)** system for question answering over an internal document corpus (engineering wikis, design docs, runbooks, support tickets). Users ask natural-language questions in an interactive chat surface and expect grounded answers **with citations** back to source documents. ## Task Design the **end-to-end architecture** and describe the **optimization strategies** you would use to make this system high-quality, low-latency, and trustworthy in production. Your design should cover, at minimum: - The full set of components: **ingestion, parsing, chunking, embeddings, indexing, retrieval, reranking, and generation**. - How you **improve retrieval relevance** and **reduce hallucinations**. - An **evaluation plan** (offline + online) and **monitoring** for corpus and embedding drift. The corpus is updated continuously (new and edited documents), contains long documents, and spans heterogeneous formats (PDF, HTML, wiki pages, Markdown). ```hint Where to start Frame it as a two-stage pipeline: an **offline/asynchronous indexing path** (ingest → parse → chunk → embed → index) and an **online query path** (query understanding → retrieve → rerank → generate → verify). Sketch both separately so latency budgets and freshness concerns live in the right place. ``` ```hint Retrieval quality The single highest-leverage relevance move is usually **hybrid retrieval** (sparse lexical like BM25 + dense vector) fused with a method like Reciprocal Rank Fusion, followed by a **cross-encoder reranker** on the top-N candidates. Think about why pure dense retrieval fails on exact tokens (error codes, IDs, version strings). ``` ```hint Grounding Reducing hallucination is not one trick but a chain: **grounded prompting** ("answer only from the provided sources"), **citation enforcement**, **confidence gating** (abstain when top scores are low), and an optional **faithfulness/entailment check** on the generated answer against the retrieved passages. ``` ### Constraints & Assumptions State your own numbers explicitly, but a reasonable baseline to design against: - Corpus on the order of **10^6–10^7 chunks** after splitting. - Interactive latency target: **p95 end-to-end < 2 s** (retrieval + rerank + first token), with answer streaming. - Ingestion freshness SLA: a new or edited document is **retrievable within minutes**, not hours. - Documents carry **access-control metadata (ACLs)** — not every user may see every document. - You may assume access to a managed or self-hosted **vector index**, a **lexical/BM25 index**, an embedding model, a cross-encoder reranker, and a generation LLM. ### Clarifying Questions to Ask A strong candidate scopes the problem before designing. For example: - What is the **expected query volume** (QPS) and the **read/write ratio** (query rate vs. ingestion rate)? - Are answers **single-turn** or **conversational** (does the retriever need to resolve follow-up references against chat history)? - How strict are the **access-control and data-isolation** requirements — must we prevent even leaking the *existence* of a forbidden document? - What is the **cost budget** per query (embedding calls, reranker calls, generation tokens)? - How will quality be judged — is there an existing **labeled eval set**, or do we need to bootstrap one? - What is the tolerance for **"I don't know" / abstention** answers versus always producing something? ### What a Strong Answer Covers The interviewer is looking for breadth across the pipeline *and* depth on the parts that matter most for RAG quality. Dimensions to hit: - **End-to-end component decomposition** with a clear split between the asynchronous indexing path and the synchronous query path, and where each latency/freshness budget lives. - **Parsing & chunking strategy** for heterogeneous, long documents — structure-aware splitting, chunk size/overlap tradeoffs, and preserving tables/code/headings. - **Retrieval design** — hybrid (sparse + dense) retrieval, metadata filtering (including ACL filtering), fusion, and a reranking stage, with the latency/quality tradeoff articulated. - **Generation & grounding** — prompt construction, citation enforcement, confidence gating / abstention, and post-hoc faithfulness verification. - **Continuous ingestion** — incremental indexing, handling edits/deletes (not just appends), and index freshness. - **Evaluation** — separate retrieval metrics (Recall@K, nDCG) from answer metrics (correctness, faithfulness/groundedness), plus an online feedback loop. - **Monitoring & drift** — index freshness, retrieval score distributions, query/corpus drift, and how failures feed back into the eval set. - **Tradeoff reasoning** — the candidate should name where they're trading latency for quality or cost, not present one fixed answer. ### Follow-up Questions - A user reports the system **confidently cited the wrong document**. Walk through how you'd debug whether the failure is in **retrieval** (wrong chunks surfaced) or **generation** (right chunks, bad synthesis), and what you'd change for each. - Your embedding model is **upgraded to a new version**. What is your re-indexing and rollout plan, and how do you avoid a mixed-embedding-space index where old and new vectors are incomparable? - How would you extend the system to support **multi-hop questions** whose answer requires combining facts from several documents? - The corpus contains **near-duplicate documents** (e.g., copies of the same runbook). How does this hurt retrieval, and how would you handle it?

Quick Answer: This question evaluates knowledge and competency in designing and optimizing Retrieval-Augmented Generation (RAG) systems, including components like ingestion, chunking, embeddings, indexing, retrieval, reranking, generation, and evaluation.

Related Interview Questions

  • Design a Text-to-Video Generation Service - OpenAI (medium)
  • Design a Text-to-Video Generation System - OpenAI (hard)
  • Design a Real-Time Sensor Intelligence System - OpenAI (medium)
  • Mine Novel Images from Unlabeled Data - OpenAI (medium)
  • Design a GPU-Efficient Video Service - OpenAI (medium)
|Home/ML System Design/OpenAI

Design and optimize a RAG system

OpenAI logo
OpenAI
Dec 15, 2025, 12:00 AM
hardMachine Learning EngineerOnsiteML System Design
31
0

Scenario

You are building a Retrieval-Augmented Generation (RAG) system for question answering over an internal document corpus (engineering wikis, design docs, runbooks, support tickets). Users ask natural-language questions in an interactive chat surface and expect grounded answers with citations back to source documents.

Task

Design the end-to-end architecture and describe the optimization strategies you would use to make this system high-quality, low-latency, and trustworthy in production.

Your design should cover, at minimum:

  • The full set of components: ingestion, parsing, chunking, embeddings, indexing, retrieval, reranking, and generation .
  • How you improve retrieval relevance and reduce hallucinations .
  • An evaluation plan (offline + online) and monitoring for corpus and embedding drift.

The corpus is updated continuously (new and edited documents), contains long documents, and spans heterogeneous formats (PDF, HTML, wiki pages, Markdown).

Constraints & Assumptions

State your own numbers explicitly, but a reasonable baseline to design against:

  • Corpus on the order of 10^6–10^7 chunks after splitting.
  • Interactive latency target: p95 end-to-end < 2 s (retrieval + rerank + first token), with answer streaming.
  • Ingestion freshness SLA: a new or edited document is retrievable within minutes , not hours.
  • Documents carry access-control metadata (ACLs) — not every user may see every document.
  • You may assume access to a managed or self-hosted vector index , a lexical/BM25 index , an embedding model, a cross-encoder reranker, and a generation LLM.

Clarifying Questions to Ask

A strong candidate scopes the problem before designing. For example:

  • What is the expected query volume (QPS) and the read/write ratio (query rate vs. ingestion rate)?
  • Are answers single-turn or conversational (does the retriever need to resolve follow-up references against chat history)?
  • How strict are the access-control and data-isolation requirements — must we prevent even leaking the existence of a forbidden document?
  • What is the cost budget per query (embedding calls, reranker calls, generation tokens)?
  • How will quality be judged — is there an existing labeled eval set , or do we need to bootstrap one?
  • What is the tolerance for "I don't know" / abstention answers versus always producing something?

What a Strong Answer Covers

The interviewer is looking for breadth across the pipeline and depth on the parts that matter most for RAG quality. Dimensions to hit:

  • End-to-end component decomposition with a clear split between the asynchronous indexing path and the synchronous query path, and where each latency/freshness budget lives.
  • Parsing & chunking strategy for heterogeneous, long documents — structure-aware splitting, chunk size/overlap tradeoffs, and preserving tables/code/headings.
  • Retrieval design — hybrid (sparse + dense) retrieval, metadata filtering (including ACL filtering), fusion, and a reranking stage, with the latency/quality tradeoff articulated.
  • Generation & grounding — prompt construction, citation enforcement, confidence gating / abstention, and post-hoc faithfulness verification.
  • Continuous ingestion — incremental indexing, handling edits/deletes (not just appends), and index freshness.
  • Evaluation — separate retrieval metrics (Recall@K, nDCG) from answer metrics (correctness, faithfulness/groundedness), plus an online feedback loop.
  • Monitoring & drift — index freshness, retrieval score distributions, query/corpus drift, and how failures feed back into the eval set.
  • Tradeoff reasoning — the candidate should name where they're trading latency for quality or cost, not present one fixed answer.

Follow-up Questions

  • A user reports the system confidently cited the wrong document . Walk through how you'd debug whether the failure is in retrieval (wrong chunks surfaced) or generation (right chunks, bad synthesis), and what you'd change for each.
  • Your embedding model is upgraded to a new version . What is your re-indexing and rollout plan, and how do you avoid a mixed-embedding-space index where old and new vectors are incomparable?
  • How would you extend the system to support multi-hop questions whose answer requires combining facts from several documents?
  • The corpus contains near-duplicate documents (e.g., copies of the same runbook). How does this hurt retrieval, and how would you handle it?

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More OpenAI•More Machine Learning Engineer•OpenAI Machine Learning Engineer•OpenAI ML System Design•Machine Learning Engineer ML System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.