PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/ML System Design/OpenAI

Design a RAG system with evaluation

Last updated: Apr 18, 2026

Quick Overview

This question evaluates expertise in designing Retrieval-Augmented Generation (RAG) systems, covering end-to-end architecture, document ingestion and preprocessing, embedding and indexing strategies, retrieval and reranking, prompt/context assembly, safety/fallbacks, and per-component evaluation.

  • medium
  • OpenAI
  • ML System Design
  • Machine Learning Engineer

Design a RAG system with evaluation

Company: OpenAI

Role: Machine Learning Engineer

Category: ML System Design

Difficulty: medium

Interview Round: Onsite

## Scenario You are asked to design a Retrieval-Augmented Generation (RAG) system that answers user questions using a private corpus (e.g., internal docs, PDFs, knowledge base articles). The interviewer wants you to walk through **each component** and explain **how you would evaluate each step**. ## Requirements - Support natural-language Q&A over private documents. - Handle frequent document updates (new/changed docs). - Provide citations or traceability to sources. - Low latency for interactive use. - Reduce hallucinations and ensure answers are grounded in retrieved context. ## What to cover 1. End-to-end architecture and data flow. 2. Document ingestion and preprocessing (parsing, cleaning, chunking). 3. Embedding strategy and indexing (vector DB / hybrid search). 4. Retrieval (query understanding, top-k, filters) and optional reranking. 5. Prompting/context assembly and generation. 6. Safety/guardrails and fallback behavior when retrieval is weak. 7. Evaluation plan for: - ingestion/chunking quality - retrieval quality - reranking quality (if used) - generation quality and grounding - end-to-end user success 8. Online monitoring and continuous improvement loop.

Quick Answer: This question evaluates expertise in designing Retrieval-Augmented Generation (RAG) systems, covering end-to-end architecture, document ingestion and preprocessing, embedding and indexing strategies, retrieval and reranking, prompt/context assembly, safety/fallbacks, and per-component evaluation.

Solution

### 1) Clarify scope and assumptions Start by asking: - Corpus size (number of docs/pages), formats (PDF/HTML/Markdown), update frequency. - Typical queries (fact lookup vs multi-hop reasoning vs summarization). - Latency budget (e.g., p95 < 2s) and cost constraints. - Compliance needs (PII, access control), multi-tenant needs. - Output requirements: citations, quotes, structured JSON, etc. Assume: internal docs, need citations, moderate scale (10^5–10^7 chunks), interactive latency. --- ### 2) High-level architecture (components) **Offline / batch (or streaming) pipeline** 1. **Ingestion**: fetch docs from sources (S3/Drive/Confluence/Git). 2. **Parsing & normalization**: extract text, preserve structure (headings, tables if possible). 3. **Chunking**: split text into retrievable units with metadata. 4. **Embedding**: compute vector embeddings for chunks (and optionally for titles/headers). 5. **Indexing**: store vectors + metadata in a vector store; optionally build lexical index (BM25) too. **Online query pipeline** 1. **Auth + policy**: apply access control filters. 2. **Query understanding**: rewrite/expand query; detect intent; extract filters. 3. **Retrieval**: vector search (and/or hybrid BM25+vector) → top-k candidates. 4. **Reranking (optional)**: cross-encoder reranker or LLM-based rerank. 5. **Context assembly**: dedupe, compress, select passages; attach citations. 6. **Generation**: grounded answer with instructions to cite and abstain if unsupported. 7. **Post-processing**: safety filters, formatting, source list, confidence. 8. **Logging**: store query, retrieved docs, model output, latency, user feedback. --- ### 3) Ingestion, parsing, chunking **Parsing** - Use format-specific parsers; preserve page numbers, headings, and source URLs. - Extract tables carefully (either linearize or store as structured text). **Chunking strategy** - Common baseline: 200–400 tokens with 10–20% overlap. - Prefer *semantic* chunking: split by headings/sections; keep coherent units. - Add metadata: doc_id, section title, timestamp, ACL tags, source link. **Pitfalls** - Too small chunks → lose context; too large → fewer retrieved units fit into prompt. - PDF extraction noise; duplicated boilerplate; headers/footers. --- ### 4) Embeddings and indexing **Embedding choices** - Use strong text embeddings; consider domain adaptation if jargon-heavy. - Store multiple representations if needed: chunk text embedding + title embedding. **Index** - Vector index (HNSW/IVF) + metadata filtering (tenant, ACL, doc type, time). - Consider **hybrid search**: BM25 for exact matches + vector for semantics. **Freshness** - Incremental updates: re-embed changed chunks; tombstone deleted docs. - Keep embedding/model versioning for reproducibility. --- ### 5) Retrieval and reranking **Retrieval** - Top-k retrieval (e.g., k=20–100) with filters. - Query rewriting: expand acronyms, convert question to search query. - Multi-query retrieval: generate 3–5 query variants and merge results. **Reranking (optional but common)** - Cross-encoder reranker on query–chunk pairs to improve precision. - LLM reranking for small candidate sets when budget allows. **Context assembly** - Deduplicate near-identical chunks. - Prefer diverse sources if answering broad questions. - Use *context compression*: summarize or extract only relevant sentences to fit token budget. --- ### 6) Generation and grounding **Prompting** - System instruction: answer only using provided context; cite sources; say “I don’t know” if missing. - Provide a clear citation format (e.g., [DocTitle §Section](URL)). **Hallucination mitigation** - Refusal/abstention policy based on retrieval confidence (e.g., if top score < threshold). - Ask clarifying questions when query is underspecified. **Guardrails** - PII redaction, policy filters, safe completion templates. - Enforce access control at retrieval time and again before final answer. --- ### 7) Evaluation: per-component and end-to-end You want both **offline** (repeatable) and **online** (real user) evaluation. #### A) Chunking / ingestion evaluation Goal: chunks should be coherent and retrievable. - Manual audits on a sampled set: coherence, duplication rate, metadata correctness. - Automated checks: average chunk length, overlap %, parser error rates. - Regression tests on known “hard” documents (tables, PDFs). #### B) Retrieval evaluation (core of RAG) Create a labeled dataset: (query, relevant chunks/docs). - **Recall@k**: fraction of queries where at least one relevant chunk is in top-k. - **MRR / nDCG**: reward correct ranking order. - Slice metrics by doc type, query type, tenant, freshness. If you lack labels: - Use weak supervision: click logs, human annotation on top results, or synthetic Q/A pairs generated from docs (with human spot checks). #### C) Reranker evaluation - Compare precision-focused metrics (e.g., nDCG@10) before vs after rerank. - Measure latency/cost tradeoff. - Error analysis: reranker bias toward longer passages, keyword overfitting. #### D) Generation evaluation (with grounding) You need to separate: 1) **Answer correctness** (does it answer the question?) 2) **Faithfulness/grounding** (is it supported by retrieved context?) 3) **Citation quality** (do citations actually back the claims?) Methods: - Human grading rubric (best early on): correctness, completeness, groundedness, readability. - LLM-as-judge with guardrails + periodic human calibration. - Automated checks: - “Attribution”: require each sentence to map to at least one cited chunk. - Contradiction detection between answer and context (imperfect but useful). #### E) End-to-end evaluation - Task success rate, user satisfaction, deflection rate (if used for support). - Latency p50/p95, cost per query. - Abstention rate: too high hurts usefulness; too low increases hallucinations. --- ### 8) Online monitoring and iteration **Logging** - Query text, rewritten query, retrieved doc IDs, scores, rerank scores, final prompt tokens, output. **Online metrics** - CTR on cited sources, thumbs up/down, “answer helpful”. - Drift: embedding distribution shifts, retrieval score distribution, doc freshness. **Feedback loop** - Collect failure cases; add to eval set. - Tune chunking, k, reranker thresholds. - Add domain-specific synonyms/expansions or structured metadata filters. --- ### 9) Common edge cases to mention - Conflicting sources: select most recent/authoritative; show multiple citations. - Multi-hop questions: iterative retrieval (retrieve → draft → retrieve again). - Access control: per-user ACL filtering is non-negotiable. - Very long docs: hierarchical retrieval (doc-level → section-level → chunk-level). This structure (pipeline + per-step evaluation + monitoring) directly answers “walk every component and how to evaluate each step.”

Related Interview Questions

  • Design a Text-to-Video Generation Service - OpenAI (medium)
  • Design a Text-to-Video Generation System - OpenAI (hard)
  • Design a Real-Time Sensor Intelligence System - OpenAI (medium)
  • Mine Novel Images from Unlabeled Data - OpenAI (medium)
  • Design a GPU-Efficient Video Service - OpenAI (medium)
OpenAI logo
OpenAI
Jan 6, 2026, 12:00 AM
Machine Learning Engineer
Onsite
ML System Design
120
0

Scenario

You are asked to design a Retrieval-Augmented Generation (RAG) system that answers user questions using a private corpus (e.g., internal docs, PDFs, knowledge base articles). The interviewer wants you to walk through each component and explain how you would evaluate each step.

Requirements

  • Support natural-language Q&A over private documents.
  • Handle frequent document updates (new/changed docs).
  • Provide citations or traceability to sources.
  • Low latency for interactive use.
  • Reduce hallucinations and ensure answers are grounded in retrieved context.

What to cover

  1. End-to-end architecture and data flow.
  2. Document ingestion and preprocessing (parsing, cleaning, chunking).
  3. Embedding strategy and indexing (vector DB / hybrid search).
  4. Retrieval (query understanding, top-k, filters) and optional reranking.
  5. Prompting/context assembly and generation.
  6. Safety/guardrails and fallback behavior when retrieval is weak.
  7. Evaluation plan for:
    • ingestion/chunking quality
    • retrieval quality
    • reranking quality (if used)
    • generation quality and grounding
    • end-to-end user success
  8. Online monitoring and continuous improvement loop.

Solution

Show

Submit Your Answer

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More OpenAI•More Machine Learning Engineer•OpenAI Machine Learning Engineer•OpenAI ML System Design•Machine Learning Engineer ML System Design
PracHub

Master your tech interviews with 8,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.