Design a Retrieval-Augmented Generation (RAG) System
Company: xAI
Role: Software Engineer
Category: ML System Design
Difficulty: medium
Interview Round: Onsite
Design a retrieval-augmented generation (RAG) system for a production question-answering product. Users ask natural-language questions, and the system must answer them **grounded in a large, continuously updated document corpus** (e.g., internal documentation, knowledge-base articles, and crawled pages) rather than from the LLM's parametric memory alone.
You own the design end to end:
- the **offline pipeline** that ingests, processes, and indexes the corpus, and
- the **online serving path** that, given a user query, retrieves relevant context, assembles a prompt, and produces a grounded answer with citations to the source passages.
Walk through the architecture, the key design decisions and their trade-offs, and how you would evaluate and monitor answer quality in production.
```hint Decompose into two planes
Split the system into an **offline indexing plane** (parse → chunk → embed → index) and an **online query plane** (embed query → retrieve → rerank → assemble prompt → generate → post-check). Design and scale each plane independently — they have completely different latency, throughput, and consistency requirements.
```
```hint Retrieval quality
Pure vector search misses exact identifiers, names, and rare terms; pure lexical search misses paraphrases. Consider **hybrid retrieval** (BM25 + dense embeddings, merged with reciprocal rank fusion) followed by a **cross-encoder reranker** over a small candidate set. Chunking granularity is the other big lever: small chunks retrieve precisely but lose context; large chunks dilute the embedding.
```
```hint Evaluating a RAG system
Evaluate the stages separately: **retrieval** with labeled (query, relevant-passage) pairs and recall@k / MRR, and **end-to-end generation** with groundedness/faithfulness (is every claim supported by the retrieved context?) and answer relevance — typically via a calibrated LLM-as-judge plus periodic human review. A bad answer can come from a good retriever and vice versa; you need to know which stage failed.
```
### Constraints & Assumptions
- Corpus: assume on the order of $10^7$ documents (~100 GB of raw text), heterogeneous formats (HTML, Markdown, PDF), mostly English.
- Freshness: documents are added and edited continuously; changes should be retrievable within minutes, not days.
- Load: assume peak traffic in the low hundreds of QPS for the answer endpoint.
- Latency: end-to-end p95 of a few seconds is acceptable (generation dominates); the retrieval stack should stay within a ~300 ms budget.
- Answers must include citations to the source passages, and the system should say it cannot answer rather than guess when the corpus has no support.
- The LLM has a large but finite context window, and per-token cost makes "stuff everything into the prompt" uneconomical.
### Clarifying Questions to Ask
- What exactly is the corpus — size, formats, languages — and how frequently does it change?
- What are the latency, throughput, and per-query cost targets?
- Do answers require strict grounding with citations, and what is the desired behavior when no relevant document exists?
- Is there document-level access control (different users allowed to see different documents)?
- Are queries single-turn, or conversational with follow-ups that need query rewriting?
- Is the LLM a hosted API or self-hosted, and can we also self-host embedding/reranking models?
### What a Strong Answer Covers
```premium-lock What a Strong Answer Covers
```
### Follow-up Questions
- How would you support multi-hop questions whose answer requires combining evidence from several documents?
- You upgrade the embedding model: how do you re-embed $10^7$ documents without downtime or a retrieval-quality dip during the migration?
- How do you enforce per-user document permissions in retrieval without destroying latency or leaking excluded content into answers?
- If you had to cut serving cost by roughly 5x with only marginal quality loss, which levers would you pull first, and why?
Quick Answer: This question evaluates understanding and engineering competency for retrieval-augmented generation (RAG) systems, covering offline indexing and online query planes, retrieval quality, prompt assembly, grounded generation with citations, and production monitoring; the domain is ML System Design.