Design a video VLM end-to-end
Company: Microsoft
Role: Machine Learning Engineer
Category: ML System Design
Difficulty: medium
Interview Round: Onsite
## Prompt: Design a video vision-language model (VLM) from scratch
You are asked to design an end-to-end system to build a **video vision-language model** that can understand videos and answer questions / follow instructions (e.g., captioning, QA, retrieval, grounding).
### Requirements
Cover the full lifecycle:
1. **Use cases & product requirements**
- What tasks (captioning, QA, retrieval, moderation, etc.)?
- Latency / throughput targets and deployment setting.
2. **Data strategy**
- Data sources (paired video-text, ASR transcripts, synthetic labels).
- Collection, labeling, deduplication, filtering, safety/compliance.
- Train/val/test split to prevent leakage.
3. **Model architecture**
- Video encoder choices (frame sampling, temporal modeling).
- Language model integration (projection, cross-attention, adapters).
- Handling long videos and variable FPS.
4. **Training plan**
- Pretraining objectives, instruction tuning, alignment.
- Distributed training setup and expected bottlenecks.
5. **Evaluation**
- Offline metrics/benchmarks for each task.
- Robustness tests (domain shift, adversarial prompts) and safety eval.
6. **Serving & iteration**
- Inference architecture (caching, batching, quantization).
- Observability, A/B tests, data flywheel, and rollback strategy.
Assume you have a small team and limited budget; justify trade-offs.
Quick Answer: This question evaluates a candidate's competency in end-to-end design of video vision-language models (VLMs), covering data strategy, model architecture, training objectives, evaluation metrics, and serving and deployment considerations.