Design a scalable video search system
Company: Runway
Role: Machine Learning Engineer
Category: ML System Design
Difficulty: hard
Interview Round: Technical Screen
Design an end-to-end video search system that supports text-to-video and video-to-video retrieval. Specify the model architecture (e.g., dual encoders for video frames/clips and text, temporal aggregation strategies, pretraining/fine-tuning objectives, negative sampling, handling long videos via clip sampling), the shared embedding space, and relevance scoring. For infrastructure, describe the offline/online feature extraction pipeline, ingestion, metadata enrichment, indexing in a vector store (e.g., HNSW/IVF, sharding and replication), storage layout for thumbnails/clips, query routing, latency/cost targets, caching, and scalability for billions of clips. Include monitoring, abuse/content safety, and evaluation via offline metrics and online A/B tests; discuss relevance feedback and incremental model updates.
Quick Answer: This question evaluates expertise in multimodal retrieval and large-scale machine learning system design, focusing on model and embedding architectures, temporal aggregation, indexing, and operational pipelines for text-to-video and video-to-video search within the ML System Design domain.