Design a production RAG system
Company: OpenAI
Role: Machine Learning Engineer
Category: ML System Design
Difficulty: hard
Interview Round: Technical Screen
Design a production retrieval-augmented generation (RAG) system for enterprise document QA. Specify the end-to-end architecture and key choices: ingestion and chunking strategy (window/stride, metadata, tables/code handling), embedding model selection and dimensionality, ANN index type and parameters (e.g., HNSW/IVF, recall/latency trade-offs), retrieval pipeline (BM25 hybrid, filters, time decay, multi-vector, cross-encoder re-ranking), prompt orchestration (grounding, citations, tool calls), context packing and deduplication, hallucination mitigation (attribution checks, answerability thresholds, refusal policy), caching layers (query/result/vector), freshness and incremental updates, multi-lingual handling, and safety/PII redaction. Detail scalability (sharding, replication, vector store choice), latency budgets and SLAs, observability (retrieval/answer quality metrics, drift monitoring), offline/online evaluation (gold sets, synthetic data, AB tests), human feedback loops, cost controls, and fallback strategies when retrieval is weak. Provide an API design, data schema for documents/embeddings, and a rollout plan.
Quick Answer: This question evaluates a candidate's competency in designing end-to-end Retrieval-Augmented Generation (RAG) systems for enterprise document QA, including architecture choices for ingestion and chunking, embedding strategies, vector indexing and ANN, hybrid retrieval and re-ranking, prompt orchestration, safety/PII controls, multilingual support, scalability, observability, API design, and rollout planning. Commonly asked in ML System Design and information retrieval/NLP interviews to assess the ability to reason about trade-offs between recall, latency, cost, and compliance, it tests both high-level architectural judgment (conceptual understanding) and implementation-level production considerations (practical application).