Vector Embeddings and ANN Indexing

What's being tested

Interviewers are checking that you can design, build, and operate a production-grade embedding retrieval pipeline: how embeddings are trained, indexed, and served for low-latency nearest-neighbor lookups while balancing recall, throughput, and cost. At Netflix, that means showing you understand tradeoffs between modeling choices (losses, normalization), index algorithms (HNSW, IVF-PQ), and operational constraints (latency budgets, memory, update frequency, monitoring). Expect to justify choices quantitatively and demonstrate how to validate retrieval quality end-to-end.

Core knowledge

Vector similarity measures: inner product (dot) and cosine similarity ( $\cos(u,v) = \frac{u \cdot v}{(||u|| \, ||v||)}$ ); with normalized vectors cosine reduces to dot product. Choose metric aligned with model objective (e.g., softmax/contrastive).
Embedding training losses: contrastive / InfoNCE, triplet loss, and softmax cross-entropy with sampled negatives; temperature τ in InfoNCE controls concentration; poor negative sampling collapses embeddings.
Approximate nearest neighbor (ANN) families: HNSW (graph-based, low-latency, high-RAM), IVF (inverted file / clustering) often combined with PQ/OPQ (product quantization / optimized PQ) for compression; Annoy (forest of trees) for read-heavy, static corpora.
FAISS and other tools: FAISS supports IVF-PQ, HNSW, GPU acceleration; Annoy for memory-mapped trees; NMSLIB offers HNSW variants — know limits and typical memory tradeoffs.
Complexity & scale: HNSW search is roughly sublinear with graph hops; IVF reduces candidate set to ~ $N/k$ per probe; PQ compresses vector storage (e.g., 64 bytes per vector vs 512 bytes raw). For $N > \sim 10M$ , prefer IVF-PQ or sharded HNSW for memory scaling.
Accuracy metrics: Recall@k, MRR (mean reciprocal rank), and nDCG for ranked quality; also track latency p99, throughput (QPS), and cost-per-query. Measure recall as ground-truth exact-NN vs ANN output.
Index lifecycle: full reindexing vs incremental updates: HNSW supports online insert/delete (costly), IVF typically requires re-clustering for optimal performance. Plan for rebuild windows and stale-embedding strategies.
Hybrid retrieval patterns: dense vector recall followed by lightweight lexical/product-anchored filters and a heavier reranker (cross-encoder) to improve precision; budget compute by only reranking top-K.
Serving architecture: co-locate index shards with GPUs/CPUs; use async background reindex jobs; cache hot items; use batching for throughput while respecting latency SLOs.
Monitoring & drift detection: track query embedding distribution shifts, retrieval recall degradation, and embedding cosine-norm changes; set synthetic workloads to monitor p99 latency and Recall@k over time.
Compression & quantization tradeoffs: PQ/OPQ reduce memory but introduce quantization error; tune number of sub-quantizers and centroid counts to reach desired recall vs size.
Negative/label quality & offline evaluation: use session logs and human judgments for ground-truth; beware label sparsity and position bias when computing offline metrics.

Worked example — "Build a low-latency ANN-based recommendation retrieval system"

First 30s framing questions: clarify corpus size (N), target latency SLO (e.g., p99 < 50ms), update frequency (real-time vs nightly), hardware constraints (GPU/CPU, memory per node), and target quality metric (Recall@50 or MRR). A strong answer splits into three pillars: (1) embedding model (loss, dimensionality, normalization, online vs offline training cadence), (2) index choice & topology (choose HNSW for sub-50ms p99 at N ~10M with enough RAM, or IVF-PQ for N ≫ 100M to save memory), and (3) serving + validation (sharding, batching, top-K rerank with a cross-encoder). Explicit tradeoff: using HNSW yields better recall and latency but uses ~3–5x more RAM than IVF-PQ; if memory limited, choose IVF with PQ and increase probe count to tune recall. Operationally plan for rolling index rebuilds and ephemeral serving nodes to avoid downtime. Close: "if I had more time, I'd design an offline recall-precision grid sweep (torch/FAISS) to pick index parameters, run A/B with holdout traffic, and implement continuous drift alerting."

A second angle — "Evaluate and monitor embedding drift and ANN quality online"

Frame as measurement + remediation problem. Build synthetic probe set of representative queries and compute baseline Recall@k and latency; run nightly automated index validation against these probes. Monitor distributional statistics: vector norms, mean pairwise distances, and embedding PCA variance; sudden shifts indicate model-change or upstream data drift. For production, maintain an online shadow index (new embeddings) and compare top-K overlap vs current index; if overlap falls below threshold trigger rollback or retrain. Also instrument downstream business metrics (watch percent-click-through on retrieved candidates) and correlate with embedding changes to catch silent degradations.

Common pitfalls

Pitfall: Confusing inner product with cosine without normalizing vectors — this leads to selecting the wrong index/metric and degraded recall; always align model objective and similarity metric.

Pitfall: Designing purely for average latency and ignoring p99 — a solution that looks fast on mean may cause unacceptable tail latency under production traffic spikes.

Pitfall: Overfocusing on index microbenchmarks (synthetic recall) without validating against business labels and reranker performance — high ANN recall doesn't guarantee better final ranked quality.

Connections

Possible interviewer pivots include candidate generation and reranking architecture, online feature stores and feature freshness, and A/B testing of retrieval changes to measure downstream impact on engagement and retention.