Design an ML‑Powered Enterprise Document Search System
Context
You are designing a multi‑tenant enterprise search system that indexes documents from many sources (e.g., Google Drive, SharePoint, Confluence, Slack, GitHub) and serves low‑latency, high‑relevance results to end users. The system must enforce strict access controls, protect sensitive data, and scale cost‑effectively.
Requirements
Specify and justify:
-
Functional
-
Supported content types (PDF, Office docs, emails, images with text, code, web pages).
-
Query types (keyword, natural language questions, filters/facets, multilingual).
-
Multi‑tenant isolation and per‑document access control lists (ACLs).
-
Personalization (user/org context, recency, popularity).
-
Non‑functional
-
Latency SLA (e.g., P50, P95, P99 for query and for reranking).
-
Freshness SLA (e.g., soft near‑real‑time vs daily; per source).
-
Availability and durability targets.
-
Scalability targets (docs, QPS, tenants).
-
Cost constraints and levers (compute, storage, egress).
-
Security, Privacy, Compliance
-
Data privacy/PII strategy, encryption in transit/at rest, key management.
-
SOC2/GDPR considerations (data residency, retention, deletion).
-
Tenant isolation (logical and optionally physical).
Indexing Pipeline
Describe the end‑to‑end ingestion and indexing flow:
-
Connectors and Sync
-
Source connectors (APIs/webhooks/CDC), change detection, incremental sync, backfill.
-
Failure handling, retries, rate limiting, state checkpoints.
-
Parsing and Enrichment
-
Document parsing (PDF/Office), OCR for images/scans, HTML cleanup, code parsing.
-
Language detection, text normalization/tokenization, boilerplate removal.
-
Metadata extraction (author, created/modified, teams, labels), ACL extraction.
-
PII/sensitive detection and optional redaction/quarantine.
-
Deduplication and near‑duplicate clustering.
-
Chunking strategy (e.g., 512–1,000 tokens with overlap) and parent mapping.
-
Feature Generation
-
Lexical features (BM25/TF‑IDF, field boosts, bigrams).
-
Semantic features (embeddings for chunks and titles; multilingual models).
-
Auxiliary features (recency, popularity, click priors, authority).
Storage and Retrieval
-
Storage
-
Inverted index (e.g., Lucene/ES) for lexical retrieval.
-
Vector store (e.g., HNSW) for ANN semantic retrieval.
-
Sharding/partitioning across tenants; metadata and ACL storage.
-
Hybrid Retrieval Strategy
-
Candidate generation from both lexical and vector stores with ACL‑aware filtering.
-
Score fusion (e.g., weighted blend or Reciprocal Rank Fusion).
-
Candidate sizes and time budgets.
Ranking and Query Understanding
-
Ranking Stack
-
Options: learning‑to‑rank (e.g., LambdaMART/XGBoost) and/or neural cross‑encoder reranker.
-
Feature set and latency budget by stage.
-
Query Understanding
-
Spelling correction, synonyms/abbreviations, stemming/lemmatization.
-
Embedding of queries, intent detection, facet suggestion.
-
Multilingual handling (cross‑lingual embeddings, optional translation).
Personalization and Feedback
-
How to incorporate user/org context, click/dwell signals, recency and popularity.
-
Feedback loops: counterfactual learning, bandits, auto‑synonym mining, human‑in‑the‑loop.
Evaluation and Monitoring
-
Offline Evaluation
-
Metrics: NDCG@k, Recall@k, MRR; dataset curation; train/val/test splits.
-
Online Evaluation
-
A/B testing, interleaving, guardrails, success KPIs.
-
Monitoring
-
Latency, errors, zero‑result rate, freshness lag, ANN recall probes, data/feature drift.
Guardrails
-
Sensitive content filters, PII handling, leakage prevention in snippets and counts.
-
Safety, abuse and enumeration defenses; rate limiting.
Scalability and Operations
-
Capacity planning; horizontal scaling; multi‑region DR.
-
Caching layers (query/results, embeddings), invalidation.
-
Result deduplication and grouping.
-
Multilingual support details (tokenization, CJK, right‑to‑left).
Access Control Enforcement
-
Exact plan for enforcing ACLs at query time across lexical and vector retrieval without leaking unauthorized documents.
-
Side‑channel protections (counts, snippets, suggestions).