Design a scalable video search system

Q: Design a scalable video search system

This is a ML System Design interview question from Runway for Machine Learning Engineer roles. View the full question and solution on PracHub.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

Design a Text-to-Video and Video-to-Video Search System

Context

You are tasked with designing an end-to-end multimodal retrieval system that supports both:

Text-to-video search (user enters a query, returns the most relevant video segments)
Video-to-video search (user provides a reference video, returns visually/semantically similar segments)

Assume a catalog that grows to billions of video clips/segments, with global users and strict latency, cost, and safety requirements.

Requirements

Model architecture
- Encoders for text and video frames/clips (e.g., dual encoders)
- Temporal aggregation for video (frame-to-clip, clip-to-video)
- Shared embedding space, similarity function, and scoring
- Pretraining/fine-tuning objectives (contrastive, multi-task), negative sampling strategies
- Handling long videos (clip sampling, segment-level indexing)
- Optional cross-encoder re-ranking for top-K
Infrastructure
- Offline/online feature extraction pipelines
- Ingestion and metadata enrichment (e.g., ASR, OCR, tags)
- Indexing in a vector store (HNSW/IVF, PQ/OPQ, sharding, replication)
- Storage layout for thumbnails and short preview clips
- Query routing and aggregation across shards
- Latency and cost targets; caching strategies
- Scalability to billions of clips/segments
Monitoring, safety, and evaluation
- System and model monitoring (SLOs, drift, recall)
- Abuse/content safety (NSFW, spam, PII)
- Evaluation via offline metrics and online A/B tests
- Relevance feedback and incremental model updates

Design a scalable video search system

Design a Text-to-Video and Video-to-Video Search System

Context

Requirements

Solution

Comments (0)