Design a low-latency RAG system

Q: Design a low-latency RAG system

This is a ML System Design interview question from OpenAI for Machine Learning Engineer roles. View the full question and solution on PracHub.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

System Design: Production-Grade RAG for Customer Support (p99 ≤ 1.5 s)

Goal

Design a production-ready retrieval-augmented generation (RAG) system for a customer-support assistant with strict latency (target p99 ≤ 1.5 s) and cost constraints.

Requirements

End-to-end architecture:
- Document ingestion
- Chunking strategy
- Embedding model choice
- Index type (vector, BM25, or hybrid)
- Caching layers
- Re-ranking
- Prompt orchestration
- Safety/guardrail components
Data management and reliability:
- Handling document updates and deletions
- Multi-tenant data isolation
- Failure modes (empty retrieval, timeouts, stale cache) and fallbacks
Evaluation:
- Offline and online metrics (answer accuracy, hallucination rate, latency, cost)
- Experiment plan (e.g., BM25 + cross-encoder re-ranker vs dense-only)
- A/B test outline
Scaling plan and capacity estimates for 10 million documents and 50 QPS:
- Index sizing, throughput, and cost
Prioritization under time pressure:
- What to cover vs. de-scope first in a 15-minute interview presentation

Design a low-latency RAG system

System Design: Production-Grade RAG for Customer Support (p99 ≤ 1.5 s)

Goal

Requirements

Solution (Locked)

Comments (0)