Design an ML search system

Q: Design an ML search system

This is a ML System Design interview question from OpenAI for Machine Learning Engineer roles. View the full question and solution on PracHub.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

Design an ML‑Powered Enterprise Document Search System

Context

You are designing a multi‑tenant enterprise search system that indexes documents from many sources (e.g., Google Drive, SharePoint, Confluence, Slack, GitHub) and serves low‑latency, high‑relevance results to end users. The system must enforce strict access controls, protect sensitive data, and scale cost‑effectively.

Requirements

Specify and justify:

Functional
- Supported content types (PDF, Office docs, emails, images with text, code, web pages).
- Query types (keyword, natural language questions, filters/facets, multilingual).
- Multi‑tenant isolation and per‑document access control lists (ACLs).
- Personalization (user/org context, recency, popularity).
Non‑functional
- Latency SLA (e.g., P50, P95, P99 for query and for reranking).
- Freshness SLA (e.g., soft near‑real‑time vs daily; per source).
- Availability and durability targets.
- Scalability targets (docs, QPS, tenants).
- Cost constraints and levers (compute, storage, egress).
Security, Privacy, Compliance
- Data privacy/PII strategy, encryption in transit/at rest, key management.
- SOC2/GDPR considerations (data residency, retention, deletion).
- Tenant isolation (logical and optionally physical).

Indexing Pipeline

Describe the end‑to‑end ingestion and indexing flow:

Connectors and Sync
- Source connectors (APIs/webhooks/CDC), change detection, incremental sync, backfill.
- Failure handling, retries, rate limiting, state checkpoints.
Parsing and Enrichment
- Document parsing (PDF/Office), OCR for images/scans, HTML cleanup, code parsing.
- Language detection, text normalization/tokenization, boilerplate removal.
- Metadata extraction (author, created/modified, teams, labels), ACL extraction.
- PII/sensitive detection and optional redaction/quarantine.
- Deduplication and near‑duplicate clustering.
- Chunking strategy (e.g., 512–1,000 tokens with overlap) and parent mapping.
Feature Generation
- Lexical features (BM25/TF‑IDF, field boosts, bigrams).
- Semantic features (embeddings for chunks and titles; multilingual models).
- Auxiliary features (recency, popularity, click priors, authority).

Storage and Retrieval

Storage
- Inverted index (e.g., Lucene/ES) for lexical retrieval.
- Vector store (e.g., HNSW) for ANN semantic retrieval.
- Sharding/partitioning across tenants; metadata and ACL storage.
Hybrid Retrieval Strategy
- Candidate generation from both lexical and vector stores with ACL‑aware filtering.
- Score fusion (e.g., weighted blend or Reciprocal Rank Fusion).
- Candidate sizes and time budgets.

Ranking and Query Understanding

Ranking Stack
- Options: learning‑to‑rank (e.g., LambdaMART/XGBoost) and/or neural cross‑encoder reranker.
- Feature set and latency budget by stage.
Query Understanding
- Spelling correction, synonyms/abbreviations, stemming/lemmatization.
- Embedding of queries, intent detection, facet suggestion.
- Multilingual handling (cross‑lingual embeddings, optional translation).

Personalization and Feedback

How to incorporate user/org context, click/dwell signals, recency and popularity.
Feedback loops: counterfactual learning, bandits, auto‑synonym mining, human‑in‑the‑loop.

Evaluation and Monitoring

Offline Evaluation
- Metrics: NDCG@k, Recall@k, MRR; dataset curation; train/val/test splits.
Online Evaluation
- A/B testing, interleaving, guardrails, success KPIs.
Monitoring
- Latency, errors, zero‑result rate, freshness lag, ANN recall probes, data/feature drift.

Guardrails

Sensitive content filters, PII handling, leakage prevention in snippets and counts.
Safety, abuse and enumeration defenses; rate limiting.

Scalability and Operations

Capacity planning; horizontal scaling; multi‑region DR.
Caching layers (query/results, embeddings), invalidation.
Result deduplication and grouping.
Multilingual support details (tokenization, CJK, right‑to‑left).

Access Control Enforcement

Exact plan for enforcing ACLs at query time across lexical and vector retrieval without leaking unauthorized documents.
Side‑channel protections (counts, snippets, suggestions).