This question evaluates understanding of GPU memory budgeting for large matrix multiplications and the comparative trade-offs between pipeline and tensor model parallelism, assessing competencies in memory sizing, numerical-precision effects (FP16/BF16), communication patterns, and performance metrics.

You are reasoning about GPU memory and parallelism for a transformer-like workload dominated by matrix multiplications.
You need to compute an output activation:
(m, k)
(k, n)
(m, n)
Assume dtype is FP16/BF16 unless stated otherwise.
Question: Given a GPU with V bytes of available VRAM (after runtime/fragmentation overhead), can you fit the tensors required for this operation in memory at once?
A
,
W
,
Y
You have 2 GPUs and want to run end-to-end inference or training.
Login required