This question evaluates understanding of multi-modal representation learning and retrieval systems, covering CLIP-style joint image–text encoders, contrastive loss families, embedding-based retrieval limitations, alternative retrieval paradigms, and issues like popularity bias.
Answer the following ML questions in the context of multi-modal (text–video/image) retrieval: