ML Search, Embeddings, And Vector Retrieval

What's being tested

Candidates must demonstrate end-to-end Retrieval-Augmented Generation (RAG) system design skills: how to convert raw documents into retrievable vectors, choose and operate an efficient vector index, integrate retrieval with LLM prompting, and build evaluation/monitoring that ensures production robustness. Interviewers probe tradeoffs between recall/latency/cost, how to evaluate retrievers and rerankers, and operational concerns an ML Engineer owns (model serving, offline/online parity, drift detection, embedding refresh). The focus is technical and implementation-oriented, not product roadmaps.

Core knowledge

RAG pillars: ingestion & chunking, embedding encoding, vector indexing (ANN), retrieval/fusion, LLM prompting/generation, and monitoring. Each pillar has measurable SLOs (recall@k, p95 latency, cost per query).
Embedding model choices: off-the-shelf (e.g., sentence-transformers/SBERT), proprietary fine-tuned encoders; dimensionality tradeoff: higher D increases representational power but memory & ANN cost roughly O(D·N).
Similarity math: use cosine similarity for length-invariant semantics (normalize vectors) or dot product when using models trained for inner-product. Relationship: cosine(u,v) = u·v / (||u|| ||v||).
ANN families: HNSW (low-latency, high-memory, parameters M, efConstruction, efSearch), IVF+PQ (inverted-file + product quantization: high compression, efficient for billions), Flat (exact, expensive). Know Faiss, Annoy, Hnswlib, Milvus, Pinecone.
Index sizing: for N ≲ 10M, HNSW in RAM is practical; for N≫100M, use IVF+PQ or sharded HNSW with disk-backed clusters and quantization to reduce memory by 4–16x.
Hybrid retrieval: combine lexical (BM25) and semantic (vector) scores; use linear combination or learned ranker. Lexical handles exact-match and rare tokens; vector handles semantic paraphrase.
Two-stage retrieval: bi-encoder (fast) for recall@k, then cross-encoder reranker (expensive, slow) for precision. Typical pattern: retrieve 100 candidates with bi-encoder, rerank top 10 with cross-encoder.
Chunking & context: chunk size tuned to model context window; use overlap (10–20%) to avoid boundary loss. For long docs, create both chunk-level and doc-level embeddings for hierarchical retrieval.
Embedding update strategy: avoid inconsistent embedding generations — version embeddings with model id and timestamp; schedule re-embedding based on document churn and model upgrade (staleness budget).
Training retrievers: use contrastive loss or in-batch negatives and hard-negative mining (e.g., using BM25 or current retriever). Evaluate with Recall@k, MRR, NDCG, and offline-to-online correlation checks.
Inference serving: batch vectorization to amortize GPU cost; use CPU-based vector index for nearest neighbor queries, keep embedding model on GPU with batching; measure end-to-end p95 and cold-start latencies.
Monitoring & drift: track embedding distribution shifts (mean cosine to centroid), recall degradation on seeded queries, and model input distribution shift. Automate alerts and canary re-embedding runs.
Cost/latency tradeoffs: cross-encoder improves quality but multiplies latency and compute; caching top-k retrievals and answers for repeated queries reduces cost.

Worked example — Design a production RAG system

First 30s clarifying Qs: ask expected query volume (qps), dataset size (documents, average doc length), latency SLO (p95), and security/tenant isolation constraints. Frame answer around three pillars: document ingestion & chunking, retrieval stack (embeddings + ANN + hybrid scoring), and generation & serving (LLM prompt design, reranking, caching, evaluation). For ingestion, declare chunk size (e.g., 1–2k tokens with 20% overlap) and metadata extraction for filtering. For retrieval, propose a bi-encoder for online vector lookup using Faiss HNSW for N up to ~10M, plus a lightweight BM25 lexical pre-filter; run a cross-encoder reranker as a second stage for top-10 candidates. Highlight a concrete tradeoff: pick HNSW for low-latency interactive UX but accept ~2–4x memory overhead; for 100M+ docs prefer IVF+PQ to keep RAM bounded at cost of slightly worse recall. Close with next steps: prototype latency & P99 experiments, A/B test reranker vs no-reranker, and implement embedding-refresh policy and monitoring hooks.

A second angle — Design LLM search handling long token inputs

This constraint shifts emphasis to context management and hierarchical retrieval. Use chunk-level embeddings for granular retrieval, then perform a second-stage grouping: fetch top chunks, merge adjacent chunks from same doc to provide continuity, and optionally run an on-the-fly condensation step (LLM-generated summary of retrieved chunks) before invoking the final long-context LLM. If the LLM has limited context, use a condense-then-augment pattern: ask a small LLM to summarize relevant chunks into a compressed context window, then query the large LLM. Also consider retrieval-then-synthesis vs retrieval-then-streaming generation depending on latency needs. Key design decision: favor retrieval depth (more chunks) when accuracy is critical, but compress aggressively when latency or token-cost dominates.

Common pitfalls

Pitfall: Evaluating only Recall@k or MRR offline and assuming production quality — offline metrics can misalign with downstream generation quality; always measure end-to-end user-facing metrics (answer accuracy, hallucination rate) and latency.

Pitfall: Treating embedding updates as free — regenerating embeddings for millions of docs is nontrivial; failing to version embeddings or coordinate rollouts causes inconsistent retrieval behavior across users.

Pitfall: Over-emphasizing a single model without hybrid fallbacks — relying solely on semantic vectors can fail on exact-match, numeric, or fresh factual queries; combine lexical, vector, and metadata filters and surface provenance.

Connections

Interviewers may pivot to adjacent MLE topics: fine-tuning/retraining pipelines for embedding models (data labeling, loss functions, and deployment), or model-serving SLOs & autoscaling for embedding and reranker services. They might also explore A/B testing and offline→online evaluation parity for ranking systems.

What's being tested

Core knowledge

Worked example — Design a production RAG system

A second angle — Design LLM search handling long token inputs

Common pitfalls

Connections

Further reading

Practice questions

Related concepts