System Design: ML-Powered Enterprise Search with RAG
Design an ML-powered enterprise search system using Retrieval-Augmented Generation (RAG) under the following context and constraints.
Context and Constraints
-
Corpus: 5M documents (avg 2 KB each) sourced from PDFs, web pages, and support tickets.
-
Freshness: Updates must be searchable within 5 minutes end-to-end.
-
Traffic: 300 QPS, multi-tenant with per-document ACLs (users/groups/roles).
-
SLOs: p95 latency ≤ 1.2 s end-to-end; budget ≤ $0.002 per query.
Assume textual content (no heavy images), standard enterprise auth (OIDC/SAML), and typical query lengths (short questions/keywords). If not stated, make minimal, reasonable assumptions to complete the design.
Sub-Questions
(a) Ingestion and chunking: Describe parsing, deduplication, metadata extraction, embedding generation, chunk-size strategy, versioning, and incremental updates.
(b) Indexing and retrieval: Propose a hybrid sparse+vector approach (BM25 + ANN), metadata filters, tenant isolation, query understanding/reformulation, top-k selection, and cross-encoder reranking.
(c) Generation: Outline prompt design, grounding with citations, constrained decoding, tool usage, streaming responses, and multilingual handling.
(d) Guardrails and safety: Methods for hallucination reduction, citation enforcement, out-of-policy refusal, PII/security controls, and ACL-aware retrieval.
(e) Evaluation and monitoring: Offline metrics (e.g., NDCG@k, recall@k, answer faithfulness), online A/B tests, user feedback loops, and drift/latency/cost monitoring.
(f) Architecture and scaling: Service decomposition, model hosting/batching, caching, vector store selection, backpressure, failover, and disaster recovery.
(g) Cost and latency calculations: Derive per-stage latency/cost, capacity plan for embeddings, ANN index size, and compute requirements. Justify model choices under the constraints.