Design a RAG system end to end

Q: Design a RAG system end to end

This is a ML System Design interview question from Amazon for Machine Learning Engineer roles. View the full question and solution on PracHub.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

Design a Retrieval‑Augmented Generation (RAG) System for Enterprise Text

Context

You are building a production RAG system that answers employee questions using internal enterprise text (wikis, PDFs, tickets, emails, docs). Data is sensitive and access-controlled. Assume multi-tenant use, mixed document formats, English-first, with the following baseline constraints:

Corpus: 5–10 million pages, tens of millions of chunks.
Traffic: 200 QPS peak; target end-to-end p95 latency ≤ 2.0 s with server-streamed tokens.
Freshness: new or updated content should be searchable within 15 minutes.

Tasks

Design the system and specify:

Ingestion pipeline: chunking strategy, embedding generation, and indexing.
Retrieval strategy: vector search, hybrid retrieval, and reranking.
Prompt orchestration: how the LLM is instructed and grounded; how citations are produced.
Freshness handling: incremental updates, cache invalidation, time-aware ranking.
Latency and throughput targets with a rough budget.
Privacy and security controls for enterprise data.
Evaluation: measuring relevance and answer quality; datasets and metrics.
Reducing hallucinations: techniques across retrieval and generation.
Scale and monitoring: how you would scale, operate, and observe the system in production.

Design a RAG system end to end

Design a Retrieval‑Augmented Generation (RAG) System for Enterprise Text

Context

Tasks

Solution (Locked)

Comments (0)