Implement CUDA-tiled matrix multiplication and explain architecture

Q: Implement CUDA-tiled matrix multiplication and explain architecture

This is a Coding & Algorithms interview question from NVIDIA for Data Scientist roles. View the full question and solution on PracHub.

Q: How do I approach Coding & Algorithms interview questions?

Coding & Algorithms questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master coding & algorithms interviews.

Question

CUDA FP32 GEMM Design Task

Implement a high-performance CUDA kernel for matrix multiplication C = A · B where:

A is m×k, B is k×n, C is m×n
Data type: FP32
Assume row-major layout unless otherwise stated.

Specify and justify the following:

Tiling and mapping

Choose concrete tile sizes and describe:
- Block tile sizes (BM×BN×BK)
- Threads per block and warp layout
- Shared-memory tiling strategy (double-buffering if any)
- Register tiling per thread (thread tile) and inner-loop unrolling strategy

Memory access efficiency

How you ensure coalesced global loads/stores
How you avoid shared-memory bank conflicts

Edge handling

How to handle tiles when m, n, or k are not multiples of the chosen tile sizes

Occupancy analysis

Given an SM with: 64 warps/SM, 64k registers/SM, and 100 KB shared memory/SM
Using your threads/block, registers/thread, and shared memory/block, compute theoretical occupancy and identify the limiting resource

Synchronization and numerical considerations

Synchronization strategy within a block
Accumulation order and precision considerations

Expected performance vs. cuBLAS

Briefly compare, quantify the expected gap, and justify why

CUDA execution and memory model

Explain grids, blocks, threads, warps, SMs; global/shared/register/constant/texture memory; barriers/atomics
Explain how these inform your design choices

Implement CUDA-tiled matrix multiplication and explain architecture

CUDA FP32 GEMM Design Task

Solution (Locked)

Comments (0)