Design multi-GPU matrix multiplication

Q: Design multi-GPU matrix multiplication

This is a ML System Design interview question from Google for Machine Learning Engineer roles. View the full question and solution on PracHub.

Q: How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

Question

Multi-GPU MatMul (2 GPUs): Design and Implementation

You are given two GPUs connected via NVLink or PCIe. You must compute C = A × B where:

A is shape m × k and B is shape k × n.
Constraint: A and B must be resident on both devices (i.e., replicated on GPU0 and GPU1).

Design a solution that includes:

Data partitioning

How you partition the output C across the two GPUs (row/column/block tiling).

Communication primitives

Which collectives or point-to-point operations you will use (e.g., all-reduce, all-gather, send/recv), and when.

Compute scheduling

The GEMM tiling strategy on each GPU.
How you overlap compute with any required communication.

Memory layout and buffer reuse

Leading dimensions, alignment, submatrix addressing, scratch/temporary buffers, and reuse.

Numerical precision

Dtypes, tensor-core utilization, accumulation precision, and determinism trade-offs.

Synchronization

Streams, events/barriers, and how you ensure correctness.

Aggregation and return of C

How you assemble and return C (to one GPU, to both GPUs, or to host) under the replication constraint for A and B.

Scalability and failure handling

How the approach scales beyond two GPUs and what changes you would make.
Failure detection, retries, and graceful degradation.

State any minimal assumptions you need (e.g., matrices fit in GPU memory, NCCL/CUDA available) and provide enough detail that an engineer could implement the system.

Design multi-GPU matrix multiplication

Multi-GPU MatMul (2 GPUs): Design and Implementation

Solution (Locked)

Comments (0)