ML Search, Embeddings, And Vector Retrieval
Asked of: ML Engineer
Last updated

What's being tested
Candidates must demonstrate end-to-end Retrieval-Augmented Generation (RAG) system design skills: how to convert raw documents into retrievable vectors, choose and operate an efficient vector index, integrate retrieval with LLM prompting, and build evaluation/monitoring that ensures production robustness. Interviewers probe tradeoffs between recall/latency/cost, how to evaluate retrievers and rerankers, and operational concerns an ML Engineer owns (model serving, offline/online parity, drift detection, embedding refresh). The focus is technical and implementation-oriented, not product roadmaps.
Core knowledge
-
RAG pillars: ingestion & chunking, embedding encoding, vector indexing (ANN), retrieval/fusion, LLM prompting/generation, and monitoring. Each pillar has measurable SLOs (recall@k,
p95latency, cost per query). -
Embedding model choices: off-the-shelf (e.g.,
sentence-transformers/SBERT), proprietary fine-tuned encoders; dimensionality tradeoff: higher D increases representational power but memory & ANN cost roughly O(D·N). -
Similarity math: use cosine similarity for length-invariant semantics (normalize vectors) or dot product when using models trained for inner-product. Relationship: cosine(u,v) = u·v / (||u|| ||v||).
-
ANN families: HNSW (low-latency, high-memory, parameters
M,efConstruction,efSearch), IVF+PQ (inverted-file + product quantization: high compression, efficient for billions), Flat (exact, expensive). KnowFaiss,Annoy,Hnswlib,Milvus,Pinecone. -
Index sizing: for N ≲ 10M, HNSW in RAM is practical; for N≫100M, use IVF+PQ or sharded HNSW with disk-backed clusters and quantization to reduce memory by 4–16x.
-
Hybrid retrieval: combine lexical (
BM25) and semantic (vector) scores; use linear combination or learned ranker. Lexical handles exact-match and rare tokens; vector handles semantic paraphrase. -
Two-stage retrieval: bi-encoder (fast) for recall@k, then cross-encoder reranker (expensive, slow) for precision. Typical pattern: retrieve 100 candidates with bi-encoder, rerank top 10 with cross-encoder.
-
Chunking & context: chunk size tuned to model context window; use overlap (10–20%) to avoid boundary loss. For long docs, create both chunk-level and doc-level embeddings for hierarchical retrieval.
-
Embedding update strategy: avoid inconsistent embedding generations — version embeddings with model id and timestamp; schedule re-embedding based on document churn and model upgrade (staleness budget).
-
Training retrievers: use contrastive loss or in-batch negatives and hard-negative mining (e.g., using
BM25or current retriever). Evaluate withRecall@k,MRR,NDCG, and offline-to-online correlation checks. -
Inference serving: batch vectorization to amortize GPU cost; use CPU-based vector index for nearest neighbor queries, keep embedding model on GPU with batching; measure end-to-end
p95and cold-start latencies. -
Monitoring & drift: track embedding distribution shifts (mean cosine to centroid), recall degradation on seeded queries, and model input distribution shift. Automate alerts and canary re-embedding runs.
-
Cost/latency tradeoffs: cross-encoder improves quality but multiplies latency and compute; caching top-k retrievals and answers for repeated queries reduces cost.
Worked example — Design a production RAG system
First 30s clarifying Qs: ask expected query volume (qps), dataset size (documents, average doc length), latency SLO (p95), and security/tenant isolation constraints. Frame answer around three pillars: document ingestion & chunking, retrieval stack (embeddings + ANN + hybrid scoring), and generation & serving (LLM prompt design, reranking, caching, evaluation). For ingestion, declare chunk size (e.g., 1–2k tokens with 20% overlap) and metadata extraction for filtering. For retrieval, propose a bi-encoder for online vector lookup using Faiss HNSW for N up to ~10M, plus a lightweight BM25 lexical pre-filter; run a cross-encoder reranker as a second stage for top-10 candidates. Highlight a concrete tradeoff: pick HNSW for low-latency interactive UX but accept ~2–4x memory overhead; for 100M+ docs prefer IVF+PQ to keep RAM bounded at cost of slightly worse recall. Close with next steps: prototype latency & P99 experiments, A/B test reranker vs no-reranker, and implement embedding-refresh policy and monitoring hooks.
A second angle — Design LLM search handling long token inputs
This constraint shifts emphasis to context management and hierarchical retrieval. Use chunk-level embeddings for granular retrieval, then perform a second-stage grouping: fetch top chunks, merge adjacent chunks from same doc to provide continuity, and optionally run an on-the-fly condensation step (LLM-generated summary of retrieved chunks) before invoking the final long-context LLM. If the LLM has limited context, use a condense-then-augment pattern: ask a small LLM to summarize relevant chunks into a compressed context window, then query the large LLM. Also consider retrieval-then-synthesis vs retrieval-then-streaming generation depending on latency needs. Key design decision: favor retrieval depth (more chunks) when accuracy is critical, but compress aggressively when latency or token-cost dominates.
Common pitfalls
Pitfall: Evaluating only
Recall@korMRRoffline and assuming production quality — offline metrics can misalign with downstream generation quality; always measure end-to-end user-facing metrics (answer accuracy, hallucination rate) and latency.
Pitfall: Treating embedding updates as free — regenerating embeddings for millions of docs is nontrivial; failing to version embeddings or coordinate rollouts causes inconsistent retrieval behavior across users.
Pitfall: Over-emphasizing a single model without hybrid fallbacks — relying solely on semantic vectors can fail on exact-match, numeric, or fresh factual queries; combine lexical, vector, and metadata filters and surface provenance.
Connections
Interviewers may pivot to adjacent MLE topics: fine-tuning/retraining pipelines for embedding models (data labeling, loss functions, and deployment), or model-serving SLOs & autoscaling for embedding and reranker services. They might also explore A/B testing and offline→online evaluation parity for ranking systems.
Further reading
-
Dense Passage Retrieval (DPR) paper — contrastive training and retrieval setup foundations.
-
Faiss library — engineering reference for
IVF/PQandHNSWimplementations and tradeoffs.
Practice questions
- Design and optimize a RAG systemOpenAI · Machine Learning Engineer · Onsite · hard
- Design a search query autocomplete systemOpenAI · Machine Learning Engineer · Onsite · hard
- Design an image/video near-duplicate detection systemOpenAI · Machine Learning Engineer · Onsite · hard
- Design an ML search systemOpenAI · Machine Learning Engineer · Technical Screen · hard
- Design a production RAG systemOpenAI · Machine Learning Engineer · Technical Screen · hard
- Design enterprise RAG search systemOpenAI · Machine Learning Engineer · Technical Screen · hard
- Design an enterprise RAG systemOpenAI · Machine Learning Engineer · Technical Screen · hard
- Design an ML search system with RAGOpenAI · Machine Learning Engineer · Technical Screen · hard
- Design LLM search handling long token inputsOpenAI · Machine Learning Engineer · Onsite · hard