You are reasoning about GPU memory and parallelism for a transformer-like workload dominated by matrix multiplications.
Part 1: Can one matmul’s tensors fit in VRAM?
You need to compute an output activation:
-
Input activation matrix
A
has shape
(m, k)
-
Weight matrix
W
has shape
(k, n)
-
Output activation matrix
Y
has shape
(m, n)
Assume dtype is FP16/BF16 unless stated otherwise.
Question: Given a GPU with V bytes of available VRAM (after runtime/fragmentation overhead), can you fit the tensors required for this operation in memory at once?
-
Consider at least:
A
,
W
,
Y
-
Optionally discuss extra memory for workspace (e.g., GEMM algorithms), alignment, and caching.
Part 2: Two GPUs — pipeline parallelism vs tensor parallelism
You have 2 GPUs and want to run end-to-end inference or training.
-
Explain how you would split the model using:
-
Pipeline Parallelism (PP)
across 2 GPUs
-
Tensor Parallelism (TP)
across 2 GPUs
-
For each approach, discuss:
-
End-to-end latency and throughput (including pipeline “bubble” effects for PP)
-
Per-GPU memory usage (what is replicated vs sharded)
-
Communication patterns and costs
-
Key tradeoffs and when you would choose PP vs TP