Design a low-latency RAG system
Company: OpenAI
Role: Machine Learning Engineer
Category: ML System Design
Difficulty: hard
Interview Round: Technical Screen
Design a production-grade retrieval-augmented generation (RAG) system for a customer-support assistant with strict latency (target p99 ≤ 1.5 s) and cost constraints. Specify the end-to-end architecture: document ingestion, chunking strategy, embedding model choice, index type (e.g., vector, BM25, or hybrid), caching layers, re-ranking, prompt orchestration, and safety/guardrail components. Describe how you will handle document updates and deletions, multi-tenant data isolation, and failure modes (e.g., empty retrieval, timeouts, stale cache). Propose offline and online evaluation: define key metrics (answer accuracy, hallucination rate, latency, cost), design an experiment plan (e.g., BM25 + cross-encoder re-ranker vs dense-only), and outline an A/B test. Provide a scaling plan and back-of-the-envelope capacity estimates for 10 million documents and 50 QPS, including index sizing, throughput, and cost. Finally, explain what you would prioritize and de-scope first if you had only 15 minutes to present a coherent plan under interview time pressure.
Quick Answer: This question evaluates production ML system design competencies, specifically retrieval-augmented generation (RAG) architecture, latency and cost optimization, indexing and retrieval strategies, caching, re-ranking, safety and operational reliability, and evaluation metrics.