This question evaluates a candidate's competency in designing scalable distributed training systems, covering parallelism strategies (data, model/tensor, pipeline), communication collectives (all-reduce, all-gather, reduce-scatter, broadcast), and tensor-level layer partitioning such as column- and row-parallel splits.
You are designing a distributed training setup for very large neural networks that cannot fit on a single device.
Answer the following:
Login required