How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

What difficulty level is this interview question?

This is a medium difficulty ML System Design question, commonly asked during Onsite rounds at xAI.

What role is this question designed for?

This question is commonly asked for Software Engineer candidates at xAI during technical interviews.

Design a Retrieval-Augmented Generation (RAG) System

Q: Design a Retrieval-Augmented Generation (RAG) System

This question evaluates understanding and engineering competency for retrieval-augmented generation (RAG) systems, covering offline indexing and online query planes, retrieval quality, prompt assembly, grounded generation with citations, and production monitoring; the domain is ML System Design.

Design a retrieval-augmented generation (RAG) system for a production question-answering product. Users ask natural-language questions, and the system must answer them grounded in a large, continuously updated document corpus (e.g., internal documentation, knowledge-base articles, and crawled pages) rather than from the LLM's parametric memory alone.

You own the design end to end:

the offline pipeline that ingests, processes, and indexes the corpus, and
the online serving path that, given a user query, retrieves relevant context, assembles a prompt, and produces a grounded answer with citations to the source passages.

Walk through the architecture, the key design decisions and their trade-offs, and how you would evaluate and monitor answer quality in production.

Constraints & Assumptions

Corpus: assume on the order of $10^7$ documents (~100 GB of raw text), heterogeneous formats (HTML, Markdown, PDF), mostly English.
Freshness: documents are added and edited continuously; changes should be retrievable within minutes, not days.
Load: assume peak traffic in the low hundreds of QPS for the answer endpoint.
Latency: end-to-end p95 of a few seconds is acceptable (generation dominates); the retrieval stack should stay within a ~300 ms budget.
Answers must include citations to the source passages, and the system should say it cannot answer rather than guess when the corpus has no support.
The LLM has a large but finite context window, and per-token cost makes "stuff everything into the prompt" uneconomical.

Clarifying Questions to Ask

What exactly is the corpus — size, formats, languages — and how frequently does it change?
What are the latency, throughput, and per-query cost targets?
Do answers require strict grounding with citations, and what is the desired behavior when no relevant document exists?
Is there document-level access control (different users allowed to see different documents)?
Are queries single-turn, or conversational with follow-ups that need query rewriting?
Is the LLM a hosted API or self-hosted, and can we also self-host embedding/reranking models?

What a Strong Answer Covers Premium

Follow-up Questions

How would you support multi-hop questions whose answer requires combining evidence from several documents?
You upgrade the embedding model: how do you re-embed $10^7$ documents without downtime or a retrieval-quality dip during the migration?
How do you enforce per-user document permissions in retrieval without destroying latency or leaking excluded content into answers?
If you had to cut serving cost by roughly 5x with only marginal quality loss, which levers would you pull first, and why?

You own the design end to end:

the offline pipeline that ingests, processes, and indexes the corpus, and
the online serving path that, given a user query, retrieves relevant context, assembles a prompt, and produces a grounded answer with citations to the source passages.

Walk through the architecture, the key design decisions and their trade-offs, and how you would evaluate and monitor answer quality in production.

Constraints & Assumptions

Corpus: assume on the order of $10^7$ documents (~100 GB of raw text), heterogeneous formats (HTML, Markdown, PDF), mostly English.
Freshness: documents are added and edited continuously; changes should be retrievable within minutes, not days.
Load: assume peak traffic in the low hundreds of QPS for the answer endpoint.
Latency: end-to-end p95 of a few seconds is acceptable (generation dominates); the retrieval stack should stay within a ~300 ms budget.
Answers must include citations to the source passages, and the system should say it cannot answer rather than guess when the corpus has no support.
The LLM has a large but finite context window, and per-token cost makes "stuff everything into the prompt" uneconomical.

Clarifying Questions to Ask

What exactly is the corpus — size, formats, languages — and how frequently does it change?
What are the latency, throughput, and per-query cost targets?
Do answers require strict grounding with citations, and what is the desired behavior when no relevant document exists?
Is there document-level access control (different users allowed to see different documents)?
Are queries single-turn, or conversational with follow-ups that need query rewriting?
Is the LLM a hosted API or self-hosted, and can we also self-host embedding/reranking models?

What a Strong Answer Covers Premium

Follow-up Questions

How would you support multi-hop questions whose answer requires combining evidence from several documents?
You upgrade the embedding model: how do you re-embed $10^7$ documents without downtime or a retrieval-quality dip during the migration?
How do you enforce per-user document permissions in retrieval without destroying latency or leaking excluded content into answers?
If you had to cut serving cost by roughly 5x with only marginal quality loss, which levers would you pull first, and why?

Design a Retrieval-Augmented Generation (RAG) System

Quick Overview

Design a Retrieval-Augmented Generation (RAG) System

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP

Design a Retrieval-Augmented Generation (RAG) System

Quick Overview

Design a Retrieval-Augmented Generation (RAG) System

Constraints & Assumptions

Clarifying Questions to Ask

What a Strong Answer Covers Premium

Follow-up Questions

Submit Your Answer to Earn 20XP