PracHub
QuestionsPremiumCoachesLearningGuidesInterview Prep
|Home/ML System Design/OpenAI

Design an ML search system with RAG

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a candidate's competence in ML system design for Retrieval-Augmented Generation, encompassing information retrieval, embedding-based ranking, prompt engineering, production model hosting, access control, monitoring, and cost/latency trade-offs.

  • hard
  • OpenAI
  • ML System Design
  • Machine Learning Engineer

Design an ML search system with RAG

Company: OpenAI

Role: Machine Learning Engineer

Category: ML System Design

Difficulty: hard

Interview Round: Technical Screen

Design an ML-powered enterprise search system using Retrieval-Augmented Generation (RAG). Context and constraints: - Corpus: 5M documents (avg 2 KB) from PDFs/web pages/tickets; updates must be searchable within 5 minutes. - Traffic: 300 QPS; multi-tenant with per-document ACLs. - SLOs: p95 latency ≤ 1.2 s end-to-end; budget ≤ $0.002 per query. Sub-questions: (a) Ingestion and chunking: parsing, deduplication, metadata extraction, embedding generation, chunk-size strategy, versioning, and incremental updates. (b) Indexing and retrieval: hybrid sparse+vector (BM25 + ANN), metadata filters, tenant isolation, query understanding/reformulation, top-k selection, and cross-encoder reranking. (c) Generation: prompt design, grounding with citations, constrained decoding, tool usage, streaming responses, and multilingual handling. (d) Guardrails and safety: hallucination reduction, citation enforcement, out-of-policy refusal, PII/security controls, and ACL-aware retrieval. (e) Evaluation and monitoring: offline metrics (NDCG@k, recall@k, answer faithfulness), online A/B tests, user feedback loops, drift/latency/cost monitoring. (f) Architecture and scaling: service decomposition, model hosting/batching, caching, vector store selection, backpressure, failover, and disaster recovery. (g) Cost and latency calculations: derive per-stage latency/cost, capacity plan for embeddings, ANN index size, and compute requirements. Justify model choices under constraints.

Quick Answer: This question evaluates a candidate's competence in ML system design for Retrieval-Augmented Generation, encompassing information retrieval, embedding-based ranking, prompt engineering, production model hosting, access control, monitoring, and cost/latency trade-offs.

Related Interview Questions

  • Design a Text-to-Video Generation Service - OpenAI (medium)
  • Design a Text-to-Video Generation System - OpenAI (hard)
  • Design a Real-Time Sensor Intelligence System - OpenAI (medium)
  • Mine Novel Images from Unlabeled Data - OpenAI (medium)
  • Design a GPU-Efficient Video Service - OpenAI (medium)
OpenAI logo
OpenAI
Jul 15, 2025, 12:00 AM
Machine Learning Engineer
Technical Screen
ML System Design
26
0

System Design: ML-Powered Enterprise Search with RAG

Design an ML-powered enterprise search system using Retrieval-Augmented Generation (RAG) under the following context and constraints.

Context and Constraints

  • Corpus: 5M documents (avg 2 KB each) sourced from PDFs, web pages, and support tickets.
  • Freshness: Updates must be searchable within 5 minutes end-to-end.
  • Traffic: 300 QPS, multi-tenant with per-document ACLs (users/groups/roles).
  • SLOs: p95 latency ≤ 1.2 s end-to-end; budget ≤ $0.002 per query.

Assume textual content (no heavy images), standard enterprise auth (OIDC/SAML), and typical query lengths (short questions/keywords). If not stated, make minimal, reasonable assumptions to complete the design.

Sub-Questions

(a) Ingestion and chunking: Describe parsing, deduplication, metadata extraction, embedding generation, chunk-size strategy, versioning, and incremental updates.

(b) Indexing and retrieval: Propose a hybrid sparse+vector approach (BM25 + ANN), metadata filters, tenant isolation, query understanding/reformulation, top-k selection, and cross-encoder reranking.

(c) Generation: Outline prompt design, grounding with citations, constrained decoding, tool usage, streaming responses, and multilingual handling.

(d) Guardrails and safety: Methods for hallucination reduction, citation enforcement, out-of-policy refusal, PII/security controls, and ACL-aware retrieval.

(e) Evaluation and monitoring: Offline metrics (e.g., NDCG@k, recall@k, answer faithfulness), online A/B tests, user feedback loops, and drift/latency/cost monitoring.

(f) Architecture and scaling: Service decomposition, model hosting/batching, caching, vector store selection, backpressure, failover, and disaster recovery.

(g) Cost and latency calculations: Derive per-stage latency/cost, capacity plan for embeddings, ANN index size, and compute requirements. Justify model choices under the constraints.

Solution

Show

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More ML System Design•More OpenAI•More Machine Learning Engineer•OpenAI Machine Learning Engineer•OpenAI ML System Design•Machine Learning Engineer ML System Design
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.