Design a Retrieval‑Augmented Generation (RAG) System for Enterprise Text
Context
You are building a production RAG system that answers employee questions using internal enterprise text (wikis, PDFs, tickets, emails, docs). Data is sensitive and access-controlled. Assume multi-tenant use, mixed document formats, English-first, with the following baseline constraints:
-
Corpus: 5–10 million pages, tens of millions of chunks.
-
Traffic: 200 QPS peak; target end-to-end p95 latency ≤ 2.0 s with server-streamed tokens.
-
Freshness: new or updated content should be searchable within 15 minutes.
Tasks
Design the system and specify:
-
Ingestion pipeline: chunking strategy, embedding generation, and indexing.
-
Retrieval strategy: vector search, hybrid retrieval, and reranking.
-
Prompt orchestration: how the LLM is instructed and grounded; how citations are produced.
-
Freshness handling: incremental updates, cache invalidation, time-aware ranking.
-
Latency and throughput targets with a rough budget.
-
Privacy and security controls for enterprise data.
-
Evaluation: measuring relevance and answer quality; datasets and metrics.
-
Reducing hallucinations: techniques across retrieval and generation.
-
Scale and monitoring: how you would scale, operate, and observe the system in production.