This question evaluates expertise in ML system design for multimodal large models, covering deployment under compute and GPU memory constraints as well as large-scale retrieval and processing of video captions and embeddings.
You are deploying a multimodal large model that generates captions for videos.
Prompt: Describe how you would design the end-to-end system (modeling + serving) to reliably deploy this capability under constrained compute/VRAM.
Assume you already have:
A brand advertiser wants to quickly find videos relevant to a query (text and/or example creative) and then apply a watermark to matched videos.
Prompt: How would you design and optimize the retrieval + processing pipeline to make this search and watermarking fast at scale? Include indexing, filtering/ranking, caching, and system trade-offs.