PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/ML System Design/xAI

Design a Retrieval-Augmented Generation (RAG) System

Last updated: Jul 2, 2026

Quick Overview

This question evaluates understanding and engineering competency for retrieval-augmented generation (RAG) systems, covering offline indexing and online query planes, retrieval quality, prompt assembly, grounded generation with citations, and production monitoring; the domain is ML System Design.

  • medium
  • xAI
  • ML System Design
  • Software Engineer

Design a Retrieval-Augmented Generation (RAG) System

Company: xAI

Role: Software Engineer

Category: ML System Design

Difficulty: medium

Interview Round: Onsite

Design a retrieval-augmented generation (RAG) system for a production question-answering product. Users ask natural-language questions, and the system must answer them **grounded in a large, continuously updated document corpus** (e.g., internal documentation, knowledge-base articles, and crawled pages) rather than from the LLM's parametric memory alone. You own the design end to end: - the **offline pipeline** that ingests, processes, and indexes the corpus, and - the **online serving path** that, given a user query, retrieves relevant context, assembles a prompt, and produces a grounded answer with citations to the source passages. Walk through the architecture, the key design decisions and their trade-offs, and how you would evaluate and monitor answer quality in production. ```hint Decompose into two planes Split the system into an **offline indexing plane** (parse → chunk → embed → index) and an **online query plane** (embed query → retrieve → rerank → assemble prompt → generate → post-check). Design and scale each plane independently — they have completely different latency, throughput, and consistency requirements. ``` ```hint Retrieval quality Pure vector search misses exact identifiers, names, and rare terms; pure lexical search misses paraphrases. Consider **hybrid retrieval** (BM25 + dense embeddings, merged with reciprocal rank fusion) followed by a **cross-encoder reranker** over a small candidate set. Chunking granularity is the other big lever: small chunks retrieve precisely but lose context; large chunks dilute the embedding. ``` ```hint Evaluating a RAG system Evaluate the stages separately: **retrieval** with labeled (query, relevant-passage) pairs and recall@k / MRR, and **end-to-end generation** with groundedness/faithfulness (is every claim supported by the retrieved context?) and answer relevance — typically via a calibrated LLM-as-judge plus periodic human review. A bad answer can come from a good retriever and vice versa; you need to know which stage failed. ``` ### Constraints & Assumptions - Corpus: assume on the order of $10^7$ documents (~100 GB of raw text), heterogeneous formats (HTML, Markdown, PDF), mostly English. - Freshness: documents are added and edited continuously; changes should be retrievable within minutes, not days. - Load: assume peak traffic in the low hundreds of QPS for the answer endpoint. - Latency: end-to-end p95 of a few seconds is acceptable (generation dominates); the retrieval stack should stay within a ~300 ms budget. - Answers must include citations to the source passages, and the system should say it cannot answer rather than guess when the corpus has no support. - The LLM has a large but finite context window, and per-token cost makes "stuff everything into the prompt" uneconomical. ### Clarifying Questions to Ask - What exactly is the corpus — size, formats, languages — and how frequently does it change? - What are the latency, throughput, and per-query cost targets? - Do answers require strict grounding with citations, and what is the desired behavior when no relevant document exists? - Is there document-level access control (different users allowed to see different documents)? - Are queries single-turn, or conversational with follow-ups that need query rewriting? - Is the LLM a hosted API or self-hosted, and can we also self-host embedding/reranking models? ### What a Strong Answer Covers ```premium-lock What a Strong Answer Covers ``` ### Follow-up Questions - How would you support multi-hop questions whose answer requires combining evidence from several documents? - You upgrade the embedding model: how do you re-embed $10^7$ documents without downtime or a retrieval-quality dip during the migration? - How do you enforce per-user document permissions in retrieval without destroying latency or leaking excluded content into answers? - If you had to cut serving cost by roughly 5x with only marginal quality loss, which levers would you pull first, and why?

Quick Answer: This question evaluates understanding and engineering competency for retrieval-augmented generation (RAG) systems, covering offline indexing and online query planes, retrieval quality, prompt assembly, grounded generation with citations, and production monitoring; the domain is ML System Design.

Related Interview Questions

  • Design agentic workflow to generate a 1-hour movie - xAI (medium)
  • Implement a trie-based tokenizer - xAI (hard)
|Home/ML System Design/xAI

Design a Retrieval-Augmented Generation (RAG) System

xAI logo
xAI
Aug 7, 2025, 12:00 AM
mediumSoftware EngineerOnsiteML System Design
0
0

Design a retrieval-augmented generation (RAG) system for a production question-answering product. Users ask natural-language questions, and the system must answer them grounded in a large, continuously updated document corpus (e.g., internal documentation, knowledge-base articles, and crawled pages) rather than from the LLM's parametric memory alone.

You own the design end to end:

  • the offline pipeline that ingests, processes, and indexes the corpus, and
  • the online serving path that, given a user query, retrieves relevant context, assembles a prompt, and produces a grounded answer with citations to the source passages.

Walk through the architecture, the key design decisions and their trade-offs, and how you would evaluate and monitor answer quality in production.

Constraints & Assumptions

  • Corpus: assume on the order of 10710^7107 documents (~100 GB of raw text), heterogeneous formats (HTML, Markdown, PDF), mostly English.
  • Freshness: documents are added and edited continuously; changes should be retrievable within minutes, not days.
  • Load: assume peak traffic in the low hundreds of QPS for the answer endpoint.
  • Latency: end-to-end p95 of a few seconds is acceptable (generation dominates); the retrieval stack should stay within a ~300 ms budget.
  • Answers must include citations to the source passages, and the system should say it cannot answer rather than guess when the corpus has no support.
  • The LLM has a large but finite context window, and per-token cost makes "stuff everything into the prompt" uneconomical.

Clarifying Questions to Ask

  • What exactly is the corpus — size, formats, languages — and how frequently does it change?
  • What are the latency, throughput, and per-query cost targets?
  • Do answers require strict grounding with citations, and what is the desired behavior when no relevant document exists?
  • Is there document-level access control (different users allowed to see different documents)?
  • Are queries single-turn, or conversational with follow-ups that need query rewriting?
  • Is the LLM a hosted API or self-hosted, and can we also self-host embedding/reranking models?

What a Strong Answer Covers Premium

Follow-up Questions

  • How would you support multi-hop questions whose answer requires combining evidence from several documents?
  • You upgrade the embedding model: how do you re-embed 10710^7107 documents without downtime or a retrieval-quality dip during the migration?
  • How do you enforce per-user document permissions in retrieval without destroying latency or leaking excluded content into answers?
  • If you had to cut serving cost by roughly 5x with only marginal quality loss, which levers would you pull first, and why?

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More xAI•More Software Engineer•xAI Software Engineer•xAI ML System Design•Software Engineer ML System Design

Your design canvas — auto-saved

PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.