System Design: Multimodal Embedding Service for User Uploads
Context
You are designing a backend service that, for each user-uploaded asset, generates vector embeddings and stores them for search and retrieval.
-
Supported asset types: documents (PDF/DOCX/TXT), images (PNG/JPEG), and videos (MP4).
-
Maximum upload size per asset: x MB (configurable; assume x is a limit enforced by the API).
-
The system must cover: ingestion, preprocessing, model selection, scalability, storage schema, and retrieval.
-
Assume multi-tenant usage and that asynchronous processing is acceptable (embeddings ready within seconds to minutes).
Requirements
-
Ingestion: Validate and accept uploads, handle security checks, and persist raw assets.
-
Preprocessing: Extract text from documents, handle OCR for scanned pages, handle frames/transcripts for videos, and normalize images/videos.
-
Model Selection: Choose models for text, image, and video embeddings (and speech-to-text where needed); justify trade-offs.
-
Scalability: Design for horizontal scale, efficient GPU/CPU utilization, and cost control; ensure observability and fault tolerance.
-
Storage Schema: Define where raw assets, metadata, and vectors live, including partitioning and indexing choices.
-
Retrieval: Support text, image, or video queries that retrieve relevant documents/images/videos; include cross-modal search.