Multimodal Embedding Storage And Pipelines

What's being tested

These prompts probe a candidate's ability to design a scalable, production-grade multimodal embedding pipeline and storage service: ingestion, modality-specific preprocessing, efficient inference (GPU/CPU batching), storage schema and retrieval/indexing, and operational tradeoffs (latency, cost, correctness). Interviewers are looking for clear system decomposition, measurable scaling/latency reasoning, idempotency and failure-handling patterns, and concrete choices for indexing and storage that match throughput/accuracy requirements. Expect to justify tradeoffs (exact vs ANN, sharding, quantization) and call out backward-compatibility and metadata/versioning practices.

Core knowledge

Embedding dimensionality and storage cost: storage ≈ N × D × B bytes where N = vectors, D = dims, B = bytes per component (float32=4). For 100M vectors at D=768, float32 storage ≈ 100M×768×4 ≈ 307GB; compress with PQ/quantization to reduce by 4–8×.
Index types — exact vs ANN: exact (brute-force, BLAS/dot optimized) guarantees correctness but is O(N) per query; ANN (e.g., HNSW, IVF+PQ) trades recall for latency, yielding sub-ms to low-ms queries at large scale.
ANN algorithms and tradeoffs: HNSW gives high recall and dynamic inserts but larger index memory; IVF+PQ reduces memory with quantization but needs offline index rebuilds for new data and has recall/latency tuning knobs.
Sharding and routing: shard by logical key (tenant_id, modality) or vector hash; routing layer (consistent hashing or metadata service) must map queries to shards; cross-shard queries either fan-out or use global aggregator.
Inference throughput and batching: GPU throughput ∝ batch size until memory/latency limits; use batching with a latency SLO curve. Use async batcher, dynamic batching, and GPU pooling for cost efficiency; fall back to CPU for small requests.
Preprocessing per modality: text → tokenization, truncation/chunking; image → resizing, cropping, color-normalization, optional region proposals; video → keyframe extraction, frame sampling, transcript extraction via OCR/ASR. Preprocessing determinism matters for idempotency.
Storage schema: minimal row: {file_id, chunk_index, modality, vector(bytes/quantized), model_version, metadata(json), created_at}. Index separately by model_version to allow multi-version coexistence and reindexing.
Idempotency & dedup: use idempotency_key derived from content-hash (e.g., SHA256 of normalized file/chunk) to avoid double-processing and detect partial failures; store operation state and last-updated timestamps.
Consistency, backfills and migrations: keep version tags; support online re-embedding by writing new model_version vectors and exposing a mapping instead of in-place overwrite; allow read-path selection between versions.
Privacy & access control: encrypt vectors at rest (KMS) and enforce per-tenant ACLs; redact PII in transcripts during preprocessing (a DS decision but enforceable).
Latency math and SLOs: for throughput R qps, average batch size B_avg, GPU can perform around T_g vectors/sec => GPU nodes required ≈ ceil(R×B_avg / T_g). Monitor p99 latency and provision headroom.
Monitoring & correctness: track embedding_dim, vector-norm distribution, nearest-neighbor recall over golden set, and ingestion lag; maintain drift alarms when distances distributions diverge.
Retrieval stack integration: expose gRPC/HTTP API with typed schema and optional similarity metrics (cosine, dot, L2) and filtering metadata predicates for hybrid retrieval.

Worked example — Design file-embedding storage system

First 30s: clarify expected SLOs (ingest throughput, query p99), max dataset size (N), modalities supported, and whether near-real-time updates are required. Skeleton pillars: (1) Ingestion & preprocessing — validate, chunk (content-hash id), modality-specific transforms (OCR/transcript/frame extraction), (2) Inference — dynamic batching service with GPU pool, model_version tagging, fallbacks to CPU, (3) Storage & Indexing — store raw vectors plus quantized index per shard, with metadata and idempotency keys, (4) Serving — ANN search layer with sharding and aggregator. Key tradeoff: choose HNSW for low-latency high-recall at the cost of memory vs IVF+PQ for memory-sensitive deployments; call out reindex cost and live-updates. Close by noting practical extensions: reembedding/backfill process, model A/B versioning, golden-set recall monitoring; state you'd explore if longer: exact cluster sizing, benchmark numbers (GPU ops/sec), and retention/eviction policies.

A second angle — Design a multimodal embedding service

This framing emphasizes multi-tenant isolation, privacy, and operational idempotency. The same decomposition applies, but add: per-tenant quotas, soft/hard isolation (logical namespaces vs dedicated shards), encryption keys per tenant, and billing metrics (vectors stored, queries). For inference, provide tenant-aware batching constraints (no cross-tenant data mixing on single GPU if isolation required). For indexing, allow tenant-level indexes or a shared global index with tenant filters; shared indexes save memory but complicate privacy/PII removal and reindexing. Here, the interviewer will probe how you balance efficiency (shared resources) vs isolation (compliance), and how you expose controls (model_version pinned per tenant, rolling updates).

Common pitfalls

Pitfall: Designing only for small N. Many candidates propose brute-force in-memory search without scaling math; show capacity planning with N, D, and expected QPS and pick ANN/sharding accordingly.

Pitfall: Ignoring idempotency and partial failures. A tempting one-shot ingestion pipeline fails on retries and duplicates; use content-hash idempotency keys and operation-state to make retries safe.

Pitfall: Treating vectors as opaque blobs. Forgetting metadata (model_version, modality, chunk_index) makes reindexing and debugging impossible; surface these fields in the schema from the start.

Connections

Interviewers commonly pivot to adjacent topics: feature stores & serving (how embeddings integrate into feature pipelines) and online retrieval systems (hybrid index+filtering, rerankers). They might also ask about cost optimization (quantization, precomputation) or operational topics like canary deployments and migration strategies.

What's being tested

Core knowledge

Worked example — Design file-embedding storage system

A second angle — Design a multimodal embedding service

Common pitfalls

Connections

Further reading

Practice questions

Related concepts