Prompt: Design a video vision-language model (VLM) from scratch
You are asked to design an end-to-end system to build a video vision-language model that can understand videos and answer questions / follow instructions (e.g., captioning, QA, retrieval, grounding).
Requirements
Cover the full lifecycle:
-
Use cases & product requirements
-
What tasks (captioning, QA, retrieval, moderation, etc.)?
-
Latency / throughput targets and deployment setting.
-
Data strategy
-
Data sources (paired video-text, ASR transcripts, synthetic labels).
-
Collection, labeling, deduplication, filtering, safety/compliance.
-
Train/val/test split to prevent leakage.
-
Model architecture
-
Video encoder choices (frame sampling, temporal modeling).
-
Language model integration (projection, cross-attention, adapters).
-
Handling long videos and variable FPS.
-
Training plan
-
Pretraining objectives, instruction tuning, alignment.
-
Distributed training setup and expected bottlenecks.
-
Evaluation
-
Offline metrics/benchmarks for each task.
-
Robustness tests (domain shift, adversarial prompts) and safety eval.
-
Serving & iteration
-
Inference architecture (caching, batching, quantization).
-
Observability, A/B tests, data flywheel, and rollback strategy.
Assume you have a small team and limited budget; justify trade-offs.