System Design: Retrieval-Augmented Generation (RAG) for Enterprise
Context
Design a production-grade, multi-tenant RAG platform for enterprise users. The system must ingest and index heterogeneous internal documents and serve secure, low-latency, cost-efficient, and accurate answers backed by citations.
Assume the following:
-
Scale: Up to 10 million documents per day across all tenants.
-
Query load: Moderate to high (varies by tenant); design to autoscale.
-
Content types: PDFs, HTML/web pages, emails, spreadsheets, and plain text.
-
Tenancy: Strong isolation with per-tenant authorization and key management.
-
Compliance: PII/PHI presence likely; data residency may be required for some tenants.
Requirements
-
Multi-tenant isolation and authorization.
-
Ingestion throughput: Up to 10M docs/day.
-
Freshness: < 5 minutes from document arrival to searchable.
-
Latency SLOs: P50 ≤ 800 ms, P95 ≤ 2 s per query.
-
PII handling: Encryption in transit/at rest, detection/redaction/de-identification.
-
Budget: Bounded cost per 1k queries (optimize and justify).
Deliverables
Describe the end-to-end architecture and justify trade-offs among accuracy, latency, and cost.
Include:
-
Ingestion, parsing/chunking, metadata extraction, embeddings pipeline.
-
Vector index selection, sharding strategy, and hybrid retrieval (sparse + dense).
-
Re-ranking, prompt orchestration, context window management, generator selection.
-
Response post-processing (citations, formatting, redaction), hallucination mitigation.
-
Evaluation: Offline/online metrics, feedback loops, active learning.
-
Guardrails/safety, caching, observability (tracing, drift, recall@k dashboards).
-
Capacity planning and autoscaling; disaster recovery; deployment options (cloud vs on‑prem).
-
A plan to run A/B experiments before rollout.