Multi-GPU MatMul (2 GPUs): Design and Implementation
You are given two GPUs connected via NVLink or PCIe. You must compute C = A × B where:
-
A is shape m × k and B is shape k × n.
-
Constraint: A and B must be resident on both devices (i.e., replicated on GPU0 and GPU1).
Design a solution that includes:
-
Data partitioning
-
How you partition the output C across the two GPUs (row/column/block tiling).
-
Communication primitives
-
Which collectives or point-to-point operations you will use (e.g., all-reduce, all-gather, send/recv), and when.
-
Compute scheduling
-
The GEMM tiling strategy on each GPU.
-
How you overlap compute with any required communication.
-
Memory layout and buffer reuse
-
Leading dimensions, alignment, submatrix addressing, scratch/temporary buffers, and reuse.
-
Numerical precision
-
Dtypes, tensor-core utilization, accumulation precision, and determinism trade-offs.
-
Synchronization
-
Streams, events/barriers, and how you ensure correctness.
-
Aggregation and return of C
-
How you assemble and return C (to one GPU, to both GPUs, or to host) under the replication constraint for A and B.
-
Scalability and failure handling
-
How the approach scales beyond two GPUs and what changes you would make.
-
Failure detection, retries, and graceful degradation.
State any minimal assumptions you need (e.g., matrices fit in GPU memory, NCCL/CUDA available) and provide enough detail that an engineer could implement the system.