How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

What difficulty level is this interview question?

This is a hard difficulty ML System Design question, commonly asked during Technical Screen rounds at OpenAI.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at OpenAI during technical interviews.

Design an ML search system with RAG | OpenAI Interview Question

Quick Overview

This question evaluates a candidate's competence in ML system design for Retrieval-Augmented Generation, encompassing information retrieval, embedding-based ranking, prompt engineering, production model hosting, access control, monitoring, and cost/latency trade-offs.

System Design: ML-Powered Enterprise Search with RAG

Design an ML-powered enterprise search system using Retrieval-Augmented Generation (RAG) under the following context and constraints.

Context and Constraints

Corpus: 5M documents (avg 2 KB each) sourced from PDFs, web pages, and support tickets.
Freshness: Updates must be searchable within 5 minutes end-to-end.
Traffic: 300 QPS, multi-tenant with per-document ACLs (users/groups/roles).
SLOs: p95 latency ≤ 1.2 s end-to-end; budget ≤ $0.002 per query.

Assume textual content (no heavy images), standard enterprise auth (OIDC/SAML), and typical query lengths (short questions/keywords). If not stated, make minimal, reasonable assumptions to complete the design.

Sub-Questions

(a) Ingestion and chunking: Describe parsing, deduplication, metadata extraction, embedding generation, chunk-size strategy, versioning, and incremental updates.

(b) Indexing and retrieval: Propose a hybrid sparse+vector approach (BM25 + ANN), metadata filters, tenant isolation, query understanding/reformulation, top-k selection, and cross-encoder reranking.

(c) Generation: Outline prompt design, grounding with citations, constrained decoding, tool usage, streaming responses, and multilingual handling.

(d) Guardrails and safety: Methods for hallucination reduction, citation enforcement, out-of-policy refusal, PII/security controls, and ACL-aware retrieval.

(e) Evaluation and monitoring: Offline metrics (e.g., NDCG@k, recall@k, answer faithfulness), online A/B tests, user feedback loops, and drift/latency/cost monitoring.

(f) Architecture and scaling: Service decomposition, model hosting/batching, caching, vector store selection, backpressure, failover, and disaster recovery.

(g) Cost and latency calculations: Derive per-stage latency/cost, capacity plan for embeddings, ANN index size, and compute requirements. Justify model choices under the constraints.

Quick Overview

Context and Constraints

Corpus: 5M documents (avg 2 KB each) sourced from PDFs, web pages, and support tickets.

Freshness: Updates must be searchable within 5 minutes end-to-end.

Traffic: 300 QPS, multi-tenant with per-document ACLs (users/groups/roles).

SLOs: p95 latency ≤ 1.2 s end-to-end; budget ≤ $0.002 per query.

Sub-Questions

(a) Ingestion and chunking: Describe parsing, deduplication, metadata extraction, embedding generation, chunk-size strategy, versioning, and incremental updates.

(b) Indexing and retrieval: Propose a hybrid sparse+vector approach (BM25 + ANN), metadata filters, tenant isolation, query understanding/reformulation, top-k selection, and cross-encoder reranking.

(c) Generation: Outline prompt design, grounding with citations, constrained decoding, tool usage, streaming responses, and multilingual handling.

(d) Guardrails and safety: Methods for hallucination reduction, citation enforcement, out-of-policy refusal, PII/security controls, and ACL-aware retrieval.

(e) Evaluation and monitoring: Offline metrics (e.g., NDCG@k, recall@k, answer faithfulness), online A/B tests, user feedback loops, and drift/latency/cost monitoring.

(f) Architecture and scaling: Service decomposition, model hosting/batching, caching, vector store selection, backpressure, failover, and disaster recovery.

(g) Cost and latency calculations: Derive per-stage latency/cost, capacity plan for embeddings, ANN index size, and compute requirements. Justify model choices under the constraints.

Design an ML search system with RAG

Quick Overview

System Design: ML-Powered Enterprise Search with RAG

Context and Constraints

Sub-Questions

Solution

Submit Your Answer to Earn 20XP

Design an ML search system with RAG

Quick Overview

System Design: ML-Powered Enterprise Search with RAG

Context and Constraints

Sub-Questions

Solution

Submit Your Answer to Earn 20XP