Design Prompt: Multimodal Text–Image Retrieval and Classification
Context
You are building a production system that uses both text (titles/descriptions/queries) and images to support:
-
Cross-modal retrieval (e.g., text-to-image and image-to-text search)
-
Item classification (e.g., product category or attributes)
Assume you have paired (image, text) examples for training and must serve at scale with low latency.
Requirements
-
Encoders
-
Specify the encoder architecture for each modality (text and image), embedding dimensionality, and any projection layers.
-
Fusion strategy
-
Choose and justify early fusion, late fusion, or cross-attention (or a hybrid), and describe how features are combined.
-
Training objectives
-
Include contrastive alignment for retrieval and a joint classification objective. Detail the loss functions.
-
Missing or noisy modalities
-
Describe how to train for and handle cases where a modality is missing or low quality at inference time.
-
Data alignment and augmentation
-
Explain how to align training data across modalities and augment it to improve robustness.
-
Evaluation metrics
-
Define metrics for retrieval and classification, plus robustness and efficiency metrics.
-
Scalability and serving latency
-
Discuss training/inference scale-out, approximate nearest-neighbor search, batching, quantization, and latency budgets.
-
Fine-tuning on new domains
-
Outline approaches for rapid domain adaptation with limited labels.