Scenario
You are deploying a multimodal large model that generates captions for videos.
Part A — Deployment under compute / VRAM constraints
-
The model takes video (frames/audio optional) and outputs a text caption.
-
You must
meet latency/throughput goals
while staying within tight
compute
and
GPU memory (VRAM)
limits.
Prompt: Describe how you would design the end-to-end system (modeling + serving) to reliably deploy this capability under constrained compute/VRAM.
Part B — Fast retrieval for brand ads + watermarking
Assume you already have:
-
A caption for each video (possibly multiple captions per video)
-
An embedding vector per video (or per caption)
A brand advertiser wants to quickly find videos relevant to a query (text and/or example creative) and then apply a watermark to matched videos.
Prompt: How would you design and optimize the retrieval + processing pipeline to make this search and watermarking fast at scale? Include indexing, filtering/ranking, caching, and system trade-offs.