Explain parallelism and collectives in training
Company: Amazon
Role: Machine Learning Engineer
Category: ML System Design
Difficulty: medium
Interview Round: Onsite
### Parallelism strategies and communication in large-scale training
You are designing a distributed training setup for very large neural networks that cannot fit on a single device.
Answer the following:
1. Describe the main parallelism strategies used in large-scale training (for example, data parallelism, model/tensor parallelism, and pipeline parallelism). For each, explain how it works and its pros and cons.
2. What are communication collectives (such as all-reduce, all-gather, reduce-scatter, and broadcast), and how are they used in distributed training?
3. In tensor model parallelism, explain the idea of splitting linear layers into column-parallel and row-parallel parts. What is "alternating column and row parallelism" across layers, and why is it beneficial?
Quick Answer: This question evaluates a candidate's competency in designing scalable distributed training systems, covering parallelism strategies (data, model/tensor, pipeline), communication collectives (all-reduce, all-gather, reduce-scatter, broadcast), and tensor-level layer partitioning such as column- and row-parallel splits.