Design a low-latency RAG system

Q: Design a low-latency RAG system

This question evaluates ML system design skills, specifically the ability to architect retrieval-augmented generation pipelines under strict latency, cost, and multi-tenancy constraints. It tests how candidates reason about vector search, re-ranking, and LLM integration at production scale — a core competency for senior machine learning engineering roles.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Q: What difficulty level is this interview question?

This is a hard difficulty ML System Design question, commonly asked during Technical Screen rounds at OpenAI.

Q: What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at OpenAI during technical interviews.

Question

Design a Low-Latency RAG System for Customer Support

Problem Statement

Design a production-grade retrieval-augmented generation (RAG) system that powers a customer-support assistant. The assistant answers user questions by retrieving relevant passages from a large corpus of support documents (help-center articles, API docs, PDFs, past tickets, internal runbooks) and grounding an LLM's response in that retrieved evidence.

The system operates under a strict latency target of p99 ≤ 1.5 s end-to-end and a real cost budget. It must serve multiple tenants (separate customer organizations) from a shared platform without leaking data across tenant boundaries, keep its knowledge current as documents change, and degrade gracefully when components fail or slow down.

You will be asked to specify the end-to-end architecture, the data-management and reliability story, an evaluation plan, a concrete scaling/capacity estimate, and finally how you would prioritize the material under interview time pressure.

Constraints & Assumptions

Latency: p99 ≤ 1.5 s end-to-end (retrieval + re-ranking + generation + safety), measured at the API.
Scale target (for the capacity section): 10 million source documents, 50 QPS steady-state.
Multi-tenancy: many customer organizations on one platform; strict per-tenant data isolation is mandatory.
Freshness: documents are added, edited, and deleted continuously; answers must reflect changes within a bounded staleness window.
Quality bar: answers must be grounded in retrieved evidence and cite sources; the assistant should abstain rather than hallucinate when evidence is insufficient.
Cost: the design should be cost-aware; you should be able to give a back-of-the-envelope (BOTE) monthly infra figure and a per-query cost.
Assume you may self-host open models or call a hosted LLM API; state your choice and its tradeoffs rather than assuming a specific vendor.

Clarifying Questions to Ask

A strong candidate scopes the problem before designing. Reasonable questions up front:

Read vs. write mix and corpus churn — how often do documents change, and is the 10M-doc corpus mostly static or rapidly updated? This drives the indexing and cache-invalidation design.
Query distribution — how repetitive are queries (do many users ask the same FAQ)? This determines how much caching can buy us.
Answer-length and interaction style — short factual answers vs. long multi-step troubleshooting, single-turn vs. multi-turn conversation. This sets the generation-token budget, the dominant latency term.
Acceptable staleness — is minutes-of-lag after an edit acceptable, or must updates be near-instant? This trades off cost against freshness.
Quality/safety priorities — what is the relative cost of a wrong/hallucinated answer vs. an "I don't know" abstention or a human handoff? This calibrates guardrail strictness and fallback behavior.
What "tenant isolation" must guarantee — logical (filter-based) isolation vs. physical (separate indices / keys), and any compliance requirements (encryption, audit, residency).

Part 1 — End-to-End Architecture

Specify the full request and ingestion path. At minimum cover: document ingestion, chunking strategy, embedding model choice, index type (dense vector, lexical/BM25, or hybrid), caching layers, re-ranking, prompt orchestration / context assembly, and safety/guardrail components. Justify each choice against the p99 ≤ 1.5 s budget — i.e., propose an explicit per-stage latency budget that sums to under 1.5 s.

What This Part Should Cover

A coherent end-to-end diagram/flow with a per-stage latency budget summing to < 1.5 s, and a defensible retrieval design (hybrid vs. dense-only) with reasoning.
Chunking and embedding choices (chunk size/overlap, model dimensionality) tied to recall and index-size tradeoffs.
Caching topology across layers and how each cache is keyed and invalidated.
Grounding and safety : how the prompt enforces citation/abstention, and where guardrails (ACL filtering, PII, policy, groundedness) sit in the path.

Part 2 — Data Management & Reliability

Describe how the system handles document updates and deletions, enforces multi-tenant data isolation, and survives failure modes (empty retrieval, component timeouts, stale cache, LLM/serving failure, load spikes). For each failure mode, give the fallback/degradation behavior.

What This Part Should Cover

A concrete update/delete path (versioning, tombstones, compaction, eventual-consistency SLA) and matching cache invalidation.
A multi-tenancy decision with an explicit isolation guarantee (filtering + encryption/keys + tenant-scoped caches).
A failure-mode table : each of empty retrieval, timeout, stale cache, LLM failure, and overload mapped to a defined fallback.
Evidence the design fails safe (abstain/handoff) rather than fabricating answers.

Part 3 — Evaluation (Offline & Online)

Define the metrics (retrieval quality, answer accuracy, hallucination/groundedness rate, latency, cost, safety, and a business metric), an offline experiment plan with concrete ablations (e.g., BM25-only vs. dense-only vs. hybrid+re-ranker), and an online A/B test outline including how you would size it.

What This Part Should Cover

A golden/eval set plan (size, where labels come from) and metrics split across retrieval, answer quality, latency, cost, and safety.
A clear set of offline ablations that isolates the value of hybrid retrieval and of the re-ranker.
An online A/B design : unit of randomization, sticky assignment, ramp, primary metric + guardrail metrics, and a sample-size estimate with stated assumptions.
How hallucination/groundedness is measured at scale (e.g., judge model calibrated against human audits) — not just retrieval recall.

Part 4 — Scaling Plan & Capacity Estimates (10M docs, 50 QPS)

Produce a back-of-the-envelope capacity plan: index sizing (memory/disk for the vector and lexical indices), throughput (can the design serve 50 QPS within budget), and cost (monthly infra order-of-magnitude and per-query cost). State your assumptions (chunks per doc, embedding dimension, tokens per query) explicitly so the numbers are checkable.

What This Part Should Cover

An explicit assumption chain (chunks/doc, $d$ , tokens) and a vector-index RAM estimate derived from it, plus a lexical-index size estimate.
A throughput argument that 50 QPS fits the latency budget (sharding/replicas, ANN params), not just a hand-wave.
A cost BOTE : monthly infra order-of-magnitude and a per-query cost, with the LLM-token term identified as dominant.
Awareness of cost/quality knobs (vector quantization, output-token caps, adaptive k, cache hit-rate).

Part 5 — Prioritization Under Time Pressure

You have only 15 minutes to present a coherent plan. State what you would cover first (the load-bearing decisions) and what you would explicitly de-scope or mention only briefly. The point is to demonstrate judgment about what matters most for this problem.

What This Part Should Cover

A ranked "cover-first" list anchored to the problem's hard constraints (latency, isolation, grounding, reliability).
An explicit de-scope list, with brief justification for why each is lower priority here.
Evidence the candidate can sequence a narrative under time pressure rather than dumping every component uniformly.

What a Strong Answer Covers

Across all parts, the interviewer is looking for cross-cutting strengths beyond the per-part rubrics:

Constraint-driven design — every component choice is justified against the p99 ≤ 1.5 s budget and the cost target, not adopted by default.
Coherence end-to-end — ingestion, retrieval, generation, evaluation, and scaling tell one consistent story (e.g., the chunking and embedding choices in Part 1 reappear in the Part 4 sizing math).
Failure-aware, fail-safe thinking — the system degrades gracefully and abstains rather than hallucinating; isolation is airtight including in caches.
Quantitative fluency — comfortable producing checkable BOTE numbers and a sample-size estimate, with assumptions stated.
Judgment and communication — clear prioritization and the ability to defend tradeoffs (hybrid vs. dense, shared vs. per-tenant index, self-host vs. API).

Follow-up Questions

Your p99 is at 1.4 s and a traffic spike pushes the re-ranker over budget. Walk through your adaptive controller's decisions in order, and the quality cost of each.
A tenant reports the assistant cited a document they deleted yesterday. Trace every place that stale content could have survived (index, caches, in-flight requests) and how your design prevents each.
You want to swap the embedding model for a better one. Describe a zero-downtime migration over a 20M-chunk index, and how you'd validate it before cutover.
Offline Recall@k improves but online containment rate does not. Give two plausible explanations and how you'd distinguish them.

Design a low-latency RAG system

Quick Overview

Design a low-latency RAG system

Design a Low-Latency RAG System for Customer Support

Problem Statement

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — End-to-End Architecture

What This Part Should Cover

Part 2 — Data Management & Reliability

What This Part Should Cover

Part 3 — Evaluation (Offline & Online)

What This Part Should Cover

Part 4 — Scaling Plan & Capacity Estimates (10M docs, 50 QPS)

What This Part Should Cover

Part 5 — Prioritization Under Time Pressure

What This Part Should Cover

What a Strong Answer Covers

Follow-up Questions

Submit Your Answer to Earn 20XP

Design a low-latency RAG system

Quick Overview

Design a low-latency RAG system

Design a Low-Latency RAG System for Customer Support

Problem Statement

Constraints & Assumptions

Clarifying Questions to Ask

Part 1 — End-to-End Architecture

What This Part Should Cover

Part 2 — Data Management & Reliability

What This Part Should Cover

Part 3 — Evaluation (Offline & Online)

What This Part Should Cover

Part 4 — Scaling Plan & Capacity Estimates (10M docs, 50 QPS)

What This Part Should Cover

Part 5 — Prioritization Under Time Pressure

What This Part Should Cover

What a Strong Answer Covers

Follow-up Questions

Submit Your Answer to Earn 20XP