Explain challenges in training multimodal LLMs
Company: Zillow
Role: Machine Learning Engineer
Category: Machine Learning
Difficulty: medium
Interview Round: Technical Screen
## Machine Learning discussion
Answer conceptually (no code). Assume you are training or adapting a **multimodal large model** (e.g., text + image, or text + audio).
1. **What is the biggest challenge** when training multimodal foundation models? Pick 1–2 top challenges and go deep.
2. Compare a **“reasoning-focused LLM”** vs a **standard instruction/chat LLM**:
- What is different in objectives/training data?
- What changes in inference (e.g., tool use, planning, test-time compute)?
- How do you evaluate reasoning quality and reliability?
Be ready to discuss practical trade-offs: data, alignment, evaluation, cost/latency, and safety.
Quick Answer: This question evaluates understanding of training and adapting multimodal large models and comparative reasoning about model objectives, data strategies, inference behavior, evaluation, alignment, cost, latency, and safety, testing competencies in model design and systems-level trade-offs.