Design an ML search system with RAG

Q: Design an ML search system with RAG

This is a ML System Design interview question from OpenAI for Machine Learning Engineer roles. View the full question and solution on PracHub.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

System Design: ML-Powered Enterprise Search with RAG

Design an ML-powered enterprise search system using Retrieval-Augmented Generation (RAG) under the following context and constraints.

Context and Constraints

Corpus: 5M documents (avg 2 KB each) sourced from PDFs, web pages, and support tickets.
Freshness: Updates must be searchable within 5 minutes end-to-end.
Traffic: 300 QPS, multi-tenant with per-document ACLs (users/groups/roles).
SLOs: p95 latency ≤ 1.2 s end-to-end; budget ≤ $0.002 per query.

Assume textual content (no heavy images), standard enterprise auth (OIDC/SAML), and typical query lengths (short questions/keywords). If not stated, make minimal, reasonable assumptions to complete the design.

Sub-Questions

(a) Ingestion and chunking: Describe parsing, deduplication, metadata extraction, embedding generation, chunk-size strategy, versioning, and incremental updates.

(b) Indexing and retrieval: Propose a hybrid sparse+vector approach (BM25 + ANN), metadata filters, tenant isolation, query understanding/reformulation, top-k selection, and cross-encoder reranking.

(c) Generation: Outline prompt design, grounding with citations, constrained decoding, tool usage, streaming responses, and multilingual handling.

(d) Guardrails and safety: Methods for hallucination reduction, citation enforcement, out-of-policy refusal, PII/security controls, and ACL-aware retrieval.

(e) Evaluation and monitoring: Offline metrics (e.g., NDCG@k, recall@k, answer faithfulness), online A/B tests, user feedback loops, and drift/latency/cost monitoring.

(f) Architecture and scaling: Service decomposition, model hosting/batching, caching, vector store selection, backpressure, failover, and disaster recovery.

(g) Cost and latency calculations: Derive per-stage latency/cost, capacity plan for embeddings, ANN index size, and compute requirements. Justify model choices under the constraints.

Design an ML search system with RAG

System Design: ML-Powered Enterprise Search with RAG

Context and Constraints

Sub-Questions

Solution (Locked)

Comments (0)