How to deploy multimodal models?
Company: Bytedance
Role: Data Scientist
Category: Machine Learning
Difficulty: hard
Interview Round: Onsite
Answer the following machine learning interview prompts for a new-grad role:
1. You need to deploy a multimodal model under strict GPU compute and VRAM constraints. How would you redesign the model and serving system to reduce memory, latency, and cost while preserving acceptable quality? Discuss tradeoffs among quantization, distillation, pruning, input compression, batching, caching, and architectural choices.
2. Suppose video captions and vector embeddings have already been precomputed and stored. How would you build a fast video retrieval system on top of these assets? Explain candidate generation, indexing, approximate nearest neighbor search, reranking, freshness, and the metrics you would use to evaluate both relevance and serving performance.
3. What is overfitting in deep learning, how would you detect it, and what are the main techniques to mitigate it?
4. Explain the principle of dropout. Why does the common implementation preserve the expected activation scale between training and inference?
5. Compare common normalization methods such as Batch Normalization, Layer Normalization, Group Normalization, and RMSNorm. When is each appropriate, and how is each handled at inference time?
6. How is reinforcement learning used in LLM post-training, especially in RLHF? Describe the overall training pipeline, the optimization objective, and major failure modes or tradeoffs.
Quick Answer: This question evaluates a new-grad data scientist's competency in deploying and optimizing multimodal models under GPU/VRAM constraints, designing fast retrieval using precomputed captions and embeddings, detecting and mitigating overfitting, understanding dropout and normalization methods, and the use of reinforcement learning in LLM post-training; Category: Machine Learning. It is commonly asked to probe system-level tradeoff reasoning, metric-driven evaluation of relevance and serving performance, and both conceptual understanding and practical application of model optimization, inference-time behavior, and end-to-end deployment pipelines.