System Design: Production-Grade RAG for Customer Support (p99 ≤ 1.5 s)
Goal
Design a production-ready retrieval-augmented generation (RAG) system for a customer-support assistant with strict latency (target p99 ≤ 1.5 s) and cost constraints.
Requirements
-
End-to-end architecture:
-
Document ingestion
-
Chunking strategy
-
Embedding model choice
-
Index type (vector, BM25, or hybrid)
-
Caching layers
-
Re-ranking
-
Prompt orchestration
-
Safety/guardrail components
-
Data management and reliability:
-
Handling document updates and deletions
-
Multi-tenant data isolation
-
Failure modes (empty retrieval, timeouts, stale cache) and fallbacks
-
Evaluation:
-
Offline and online metrics (answer accuracy, hallucination rate, latency, cost)
-
Experiment plan (e.g., BM25 + cross-encoder re-ranker vs dense-only)
-
A/B test outline
-
Scaling plan and capacity estimates for 10 million documents and 50 QPS:
-
Index sizing, throughput, and cost
-
Prioritization under time pressure:
-
What to cover vs. de-scope first in a 15-minute interview presentation