Design a Text-to-Video and Video-to-Video Search System
Context
You are tasked with designing an end-to-end multimodal retrieval system that supports both:
-
Text-to-video search (user enters a query, returns the most relevant video segments)
-
Video-to-video search (user provides a reference video, returns visually/semantically similar segments)
Assume a catalog that grows to billions of video clips/segments, with global users and strict latency, cost, and safety requirements.
Requirements
-
Model architecture
-
Encoders for text and video frames/clips (e.g., dual encoders)
-
Temporal aggregation for video (frame-to-clip, clip-to-video)
-
Shared embedding space, similarity function, and scoring
-
Pretraining/fine-tuning objectives (contrastive, multi-task), negative sampling strategies
-
Handling long videos (clip sampling, segment-level indexing)
-
Optional cross-encoder re-ranking for top-K
-
Infrastructure
-
Offline/online feature extraction pipelines
-
Ingestion and metadata enrichment (e.g., ASR, OCR, tags)
-
Indexing in a vector store (HNSW/IVF, PQ/OPQ, sharding, replication)
-
Storage layout for thumbnails and short preview clips
-
Query routing and aggregation across shards
-
Latency and cost targets; caching strategies
-
Scalability to billions of clips/segments
-
Monitoring, safety, and evaluation
-
System and model monitoring (SLOs, drift, recall)
-
Abuse/content safety (NSFW, spam, PII)
-
Evaluation via offline metrics and online A/B tests
-
Relevance feedback and incremental model updates