Design and implement computing C = A × B across two GPUs when A and B must reside on both devices. Specify data partitioning (row/column/block tiling), communication primitives (e.g., all-reduce, all-gather, point-to-point), compute scheduling (tiled GEMM with overlap of compute and communication), memory layout and buffer reuse, numerical precision, synchronization, how you aggregate and return C, and discuss scalability and failure handling.

This question evaluates proficiency in multi-GPU parallelism and system-level ML engineering, covering data partitioning, inter-GPU communication primitives, compute scheduling and overlap, memory layout and buffer reuse, numerical precision trade-offs, synchronization, scalability, and failure handling.

How do I approach ML System Design interview questions?

ML System Design questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master ml system design interviews.

What difficulty level is this interview question?

This is a hard difficulty ML System Design question, commonly asked during Technical Screen rounds at Google.

What role is this question designed for?

This question is commonly asked for Machine Learning Engineer candidates at Google during technical interviews.

Design multi-GPU matrix multiplication | Google Interview Question

Multi-GPU MatMul (2 GPUs): Design and Implementation

You are given two GPUs connected via NVLink or PCIe. You must compute C = A × B where:

A is shape m × k and B is shape k × n.
Constraint: A and B must be resident on both devices (i.e., replicated on GPU0 and GPU1).

Design a solution that includes:

Data partitioning

How you partition the output C across the two GPUs (row/column/block tiling).

Communication primitives

Which collectives or point-to-point operations you will use (e.g., all-reduce, all-gather, send/recv), and when.

Compute scheduling

The GEMM tiling strategy on each GPU.
How you overlap compute with any required communication.

Memory layout and buffer reuse

Leading dimensions, alignment, submatrix addressing, scratch/temporary buffers, and reuse.

Numerical precision

Dtypes, tensor-core utilization, accumulation precision, and determinism trade-offs.

Synchronization

Streams, events/barriers, and how you ensure correctness.

Aggregation and return of C

How you assemble and return C (to one GPU, to both GPUs, or to host) under the replication constraint for A and B.

Scalability and failure handling

How the approach scales beyond two GPUs and what changes you would make.
Failure detection, retries, and graceful degradation.

State any minimal assumptions you need (e.g., matrices fit in GPU memory, NCCL/CUDA available) and provide enough detail that an engineer could implement the system.

Multi-GPU MatMul (2 GPUs): Design and Implementation

You are given two GPUs connected via NVLink or PCIe. You must compute C = A × B where:

A is shape m × k and B is shape k × n.
Constraint: A and B must be resident on both devices (i.e., replicated on GPU0 and GPU1).

Design a solution that includes:

Data partitioning

How you partition the output C across the two GPUs (row/column/block tiling).

Communication primitives

Which collectives or point-to-point operations you will use (e.g., all-reduce, all-gather, send/recv), and when.

Compute scheduling

The GEMM tiling strategy on each GPU.
How you overlap compute with any required communication.

Memory layout and buffer reuse

Leading dimensions, alignment, submatrix addressing, scratch/temporary buffers, and reuse.

Numerical precision

Dtypes, tensor-core utilization, accumulation precision, and determinism trade-offs.

Synchronization

Streams, events/barriers, and how you ensure correctness.

Aggregation and return of C

How you assemble and return C (to one GPU, to both GPUs, or to host) under the replication constraint for A and B.

Scalability and failure handling

How the approach scales beyond two GPUs and what changes you would make.
Failure detection, retries, and graceful degradation.

State any minimal assumptions you need (e.g., matrices fit in GPU memory, NCCL/CUDA available) and provide enough detail that an engineer could implement the system.

Design multi-GPU matrix multiplication

Quick Overview

Multi-GPU MatMul (2 GPUs): Design and Implementation

Solution

Comments (0)

Design multi-GPU matrix multiplication

Quick Overview

Multi-GPU MatMul (2 GPUs): Design and Implementation

Solution

Comments (0)