You are asked to deep-dive (“resume grilling”) on a Vision-Language Model (VLM) project listed on your resume.
Cover the following clearly and concretely:
-
Problem & scope
-
What task(s) did the VLM solve (e.g., captioning, VQA, retrieval, grounding, OCR+reasoning)?
-
What was the success criterion (offline metrics and/or product metric)?
-
Model architecture
-
High-level structure (vision encoder, language model, fusion mechanism).
-
Where fusion happens (early/late; cross-attention; adapters; projection layers).
-
What was frozen vs trainable.
-
Data & distribution
-
What datasets you used (public and/or internal).
-
Label types (pairs, dialogs, preferences, bboxes, masks).
-
Data distribution and known biases (domains, languages, image types, long-tail).
-
Train/val/test split strategy and leakage prevention.
-
Training recipe
-
Objective(s): contrastive, next-token prediction, instruction tuning, RLHF/DPO, multi-task.
-
Pretraining vs finetuning stages.
-
Key hyperparameters and infrastructure (batching, mixed precision, sequence length, curriculum).
-
Evaluation: what benchmarks, ablations, and error analysis.
-
End-to-end vs modular
-
Was it trained end-to-end? If not, what parts were fixed and why?
-
Trade-offs: stability, compute, data needs, and ability to adapt.
-
Reasoning time / latency
-
Where inference time is spent (vision encoder, KV-cache, decoding).
-
Throughput/latency numbers and how you measured them.
-
Optimizations tried (quantization, speculative decoding, caching, batching).
-
Limitations & improvements
-
Known failure modes (hallucination, OCR errors, spatial reasoning, counting, bias, adversarial images).
-
Concrete proposals to improve (data, architecture, training, evaluation, serving).
Answer as if in an onsite: concise, technical, and with specific examples and numbers where possible.