System Design: Multimodal Embedding Pipeline for Documents, Images, and Videos
You are designing a production service that computes embeddings for user‑uploaded files across modalities—documents, images, and videos—and persists results for search and analytics.
Assume:
-
Each file is at most x MB (a configurable limit enforced at the API).
-
Processing is asynchronous with eventual consistency (embeddings become available after ingestion completes).
-
Multi‑tenant, privacy‑sensitive environment.
Requirements
-
Ingestion API
-
Endpoints to submit files or URLs, specify modality and options, and poll status.
-
Idempotency, retries, and concurrency controls.
-
Validation and Preprocessing
-
Documents: text extraction, language detection, tokenization, chunking with overlap, boilerplate removal.
-
Images: format normalization, orientation, resizing/cropping, optional multi‑crop/tiling.
-
Videos: frame sampling or clip extraction, optional ASR for audio track, keyframe/shot detection.
-
Model Choices per Modality
-
Text/document embedding model.
-
Image embedding model.
-
Video embedding model or aggregation of frame embeddings; optional fusion with ASR text.
-
Consider a single cross‑modal space vs. per‑modality spaces.
-
Throughput Engineering
-
Batching, GPU/accelerator scheduling, and backpressure.
-
Worker pools for CPU preprocessing vs. GPU inference.
-
Storage Design
-
How to store embeddings and metadata (vector store vs. relational/columnar DB).
-
Schema, versioning, and multi‑tenancy.
-
Support similarity search and metadata filters.
-
Data Hygiene and Robustness
-
Deduplicate near‑identical content (files, chunks, frames).
-
Retries, idempotency, and exactly‑once/write‑once semantics.
-
Backfills when models are updated (index rebuilds, blue/green swaps).
-
Operations
-
Monitoring, alerting, and tracing.
-
Quality evaluation (offline metrics, canaries, A/B tests).
-
Cost controls (batching, quantization, index compression, TTLs).
-
Privacy/security (encryption, access control, retention policies).