##### Question Design a production retrieval-augmented generation (RAG) system for enterprise document QA. Walk through the end-to-end architecture and justify the key choices, then address the operational concerns below. 1. **Requirements and SLOs.** Support multi-tenant isolation and authorization with zero cross-tenant data leakage. Handle ingestion of heterogeneous documents (PDF, HTML, emails, spreadsheets, code) at up to 10M docs/day. Meet near-real-time freshness (under 5 minutes from arrival to searchable) and latency targets of P50 ≤ 800 ms and P95 ≤ 2 s per query. Enforce strong PII handling (encryption at rest/in transit, redaction) and per-1k-query budget constraints. 2. **Ingestion and chunking.** Describe connectors and change capture, parsing/normalization of heterogeneous formats, the chunking strategy (window size / stride / overlap, metadata, handling of tables and code), content-hash dedup on ingest, and PII redaction during preprocessing. 3. **Embeddings.** Choose embedding model(s) and dimensionality. Discuss multilingual vs domain-specialized (e.g. code) embeddings, multi-vector representations per chunk, normalization/similarity, and embedding-model versioning for migrations. 4. **Indexing and ANN.** Pick the vector index type and parameters (e.g. HNSW vs IVF-PQ, recall/latency/memory trade-offs), the sparse (BM25) index, sharding and replication strategy, hot/warm/cold tiering, and consistency / zero-downtime reindex. Include memory and capacity-sizing math. 5. **Retrieval and re-ranking.** Specify query understanding, hybrid (sparse + dense) candidate generation with ACL filters and time decay, score fusion and dedup, cross-encoder re-ranking, and an answerability threshold. 6. **Prompt orchestration.** Cover context-window budgeting, context packing and deduplication, citation/grounding requirements, tool calls, and generator selection / model routing across quality and cost tiers. 7. **Hallucination mitigation and safety.** Detail attribution/coverage checks, abstention and refusal policy, guardrails, prompt-injection defenses on retrieved content, and runtime PII/compliance enforcement. 8. **Caching and freshness.** Describe query/result, retrieval, embedding, and reranker caches, cache-key design and invalidation on document updates, and incremental/CDC-driven freshness with online index swaps. 9. **Multilingual handling.** Explain language-aware analyzers and cross-lingual retrieval/generation. 10. **Scalability, capacity planning, and resilience.** Discuss autoscaling of each stage, capacity planning for ingestion and the query path, latency budgets per stage, and disaster recovery (multi-AZ/cross-region replication, snapshots, RTO/RPO tiers). 11. **Observability and evaluation.** Define retrieval and answer-quality metrics, drift monitoring, end-to-end tracing, offline gold sets and synthetic data, online A/B tests, and human feedback / active-learning loops. 12. **Cost controls and trade-offs.** Provide a per-1k-query cost model, levers to control cost, and an explicit discussion of accuracy vs latency vs cost trade-offs. 13. **Fallback strategies.** Describe behavior when retrieval is weak (clarification, scope expansion, refusal). 14. **Interfaces.** Provide an API design and a data schema for documents/embeddings. 15. **Deployment and rollout.** Compare cloud vs on-prem deployment, and give a phased rollout plan with A/B experiments and rollback.

An OpenAI ML system design interview question (MLE technical screen): design a production, multi-tenant retrieval-augmented generation (RAG) system for enterprise document QA. The model answer covers ingestion and chunking, embedding selection, ANN/vector indexing (HNSW vs IVF-PQ), hybrid retrieval and cross-encoder re-ranking, prompt orchestration with citation grounding, hallucination mitigation, caching and freshness, multilingual and PII/safety handling, scalability and disaster recovery, observability and offline/online evaluation, an API and data schema, cost controls, and a phased rollout. It tests trade-off reasoning across recall, latency, cost, and compliance.

How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

What difficulty level is this interview question?

This is a hard difficulty ML System Design question, commonly asked during Technical Screen rounds at OpenAI.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at OpenAI during technical interviews.

Design a production RAG system | OpenAI Interview Question

Question

Design a production retrieval-augmented generation (RAG) system for enterprise document QA. Walk through the end-to-end architecture and justify the key choices, then address the operational concerns below.

Requirements and SLOs. Support multi-tenant isolation and authorization with zero cross-tenant data leakage. Handle ingestion of heterogeneous documents (PDF, HTML, emails, spreadsheets, code) at up to 10M docs/day. Meet near-real-time freshness (under 5 minutes from arrival to searchable) and latency targets of P50 ≤ 800 ms and P95 ≤ 2 s per query. Enforce strong PII handling (encryption at rest/in transit, redaction) and per-1k-query budget constraints.
Ingestion and chunking. Describe connectors and change capture, parsing/normalization of heterogeneous formats, the chunking strategy (window size / stride / overlap, metadata, handling of tables and code), content-hash dedup on ingest, and PII redaction during preprocessing.
Embeddings. Choose embedding model(s) and dimensionality. Discuss multilingual vs domain-specialized (e.g. code) embeddings, multi-vector representations per chunk, normalization/similarity, and embedding-model versioning for migrations.
Indexing and ANN. Pick the vector index type and parameters (e.g. HNSW vs IVF-PQ, recall/latency/memory trade-offs), the sparse (BM25) index, sharding and replication strategy, hot/warm/cold tiering, and consistency / zero-downtime reindex. Include memory and capacity-sizing math.
Retrieval and re-ranking. Specify query understanding, hybrid (sparse + dense) candidate generation with ACL filters and time decay, score fusion and dedup, cross-encoder re-ranking, and an answerability threshold.
Prompt orchestration. Cover context-window budgeting, context packing and deduplication, citation/grounding requirements, tool calls, and generator selection / model routing across quality and cost tiers.
Hallucination mitigation and safety. Detail attribution/coverage checks, abstention and refusal policy, guardrails, prompt-injection defenses on retrieved content, and runtime PII/compliance enforcement.
Caching and freshness. Describe query/result, retrieval, embedding, and reranker caches, cache-key design and invalidation on document updates, and incremental/CDC-driven freshness with online index swaps.
Multilingual handling. Explain language-aware analyzers and cross-lingual retrieval/generation.
Scalability, capacity planning, and resilience. Discuss autoscaling of each stage, capacity planning for ingestion and the query path, latency budgets per stage, and disaster recovery (multi-AZ/cross-region replication, snapshots, RTO/RPO tiers).
Observability and evaluation. Define retrieval and answer-quality metrics, drift monitoring, end-to-end tracing, offline gold sets and synthetic data, online A/B tests, and human feedback / active-learning loops.
Cost controls and trade-offs. Provide a per-1k-query cost model, levers to control cost, and an explicit discussion of accuracy vs latency vs cost trade-offs.
Fallback strategies. Describe behavior when retrieval is weak (clarification, scope expansion, refusal).
Interfaces. Provide an API design and a data schema for documents/embeddings.
Deployment and rollout. Compare cloud vs on-prem deployment, and give a phased rollout plan with A/B experiments and rollback.

Question

Requirements and SLOs. Support multi-tenant isolation and authorization with zero cross-tenant data leakage. Handle ingestion of heterogeneous documents (PDF, HTML, emails, spreadsheets, code) at up to 10M docs/day. Meet near-real-time freshness (under 5 minutes from arrival to searchable) and latency targets of P50 ≤ 800 ms and P95 ≤ 2 s per query. Enforce strong PII handling (encryption at rest/in transit, redaction) and per-1k-query budget constraints.
Ingestion and chunking. Describe connectors and change capture, parsing/normalization of heterogeneous formats, the chunking strategy (window size / stride / overlap, metadata, handling of tables and code), content-hash dedup on ingest, and PII redaction during preprocessing.
Embeddings. Choose embedding model(s) and dimensionality. Discuss multilingual vs domain-specialized (e.g. code) embeddings, multi-vector representations per chunk, normalization/similarity, and embedding-model versioning for migrations.
Indexing and ANN. Pick the vector index type and parameters (e.g. HNSW vs IVF-PQ, recall/latency/memory trade-offs), the sparse (BM25) index, sharding and replication strategy, hot/warm/cold tiering, and consistency / zero-downtime reindex. Include memory and capacity-sizing math.
Retrieval and re-ranking. Specify query understanding, hybrid (sparse + dense) candidate generation with ACL filters and time decay, score fusion and dedup, cross-encoder re-ranking, and an answerability threshold.
Prompt orchestration. Cover context-window budgeting, context packing and deduplication, citation/grounding requirements, tool calls, and generator selection / model routing across quality and cost tiers.
Hallucination mitigation and safety. Detail attribution/coverage checks, abstention and refusal policy, guardrails, prompt-injection defenses on retrieved content, and runtime PII/compliance enforcement.
Caching and freshness. Describe query/result, retrieval, embedding, and reranker caches, cache-key design and invalidation on document updates, and incremental/CDC-driven freshness with online index swaps.
Multilingual handling. Explain language-aware analyzers and cross-lingual retrieval/generation.
Scalability, capacity planning, and resilience. Discuss autoscaling of each stage, capacity planning for ingestion and the query path, latency budgets per stage, and disaster recovery (multi-AZ/cross-region replication, snapshots, RTO/RPO tiers).
Observability and evaluation. Define retrieval and answer-quality metrics, drift monitoring, end-to-end tracing, offline gold sets and synthetic data, online A/B tests, and human feedback / active-learning loops.
Cost controls and trade-offs. Provide a per-1k-query cost model, levers to control cost, and an explicit discussion of accuracy vs latency vs cost trade-offs.
Fallback strategies. Describe behavior when retrieval is weak (clarification, scope expansion, refusal).
Interfaces. Provide an API design and a data schema for documents/embeddings.
Deployment and rollout. Compare cloud vs on-prem deployment, and give a phased rollout plan with A/B experiments and rollback.

Design a production RAG system

Quick Overview

Design a production RAG system

Question

Submit Your Answer to Earn 20XP

Design a production RAG system

Quick Overview

Design a production RAG system

Question

Submit Your Answer to Earn 20XP