Implement a high-performance kernel for C = A(m×k) · B(k×n) in CUDA (FP32). Specify: 1) Tile sizes, thread/block layout, shared-memory tiling, register tiling, and unrolling strategy. 2) How you ensure coalesced global loads/stores and avoid shared-memory bank conflicts. 3) Handling of edge tiles when m, n, or k are not multiples of the tile size. 4) Occupancy analysis on an SM with 64 warps/SM, 64k registers/SM, and 100 KB shared memory/SM: given your threads/block, registers/thread, and shared memory/block, compute the occupancy and identify the limiting resource. 5) Synchronization strategy and numerical considerations (accumulation order). 6) Briefly compare expected performance vs. cuBLAS and justify any gap. 7) Explain CUDA’s execution/memory model (grids, blocks, threads, warps, SMs; global/shared/register/constant/texture memory; barriers/atomics) and how it informs your design.

This question evaluates a candidate's competency in CUDA GPU programming, parallel algorithms, and performance engineering for FP32 matrix multiplication, covering tiling strategies, memory hierarchy (global/shared/register), synchronization, numerical precision, and occupancy/resource analysis.

How do I approach Coding & Algorithms interview questions?

Coding & Algorithms questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master coding & algorithms interviews.

What difficulty level is this interview question?

This is a hard difficulty Coding & Algorithms question, commonly asked during HR Screen rounds at NVIDIA.

What role is this question designed for?

This question is commonly asked for Data Scientist candidates at NVIDIA during technical interviews.

Implement CUDA-tiled matrix multiplication and explain architecture

CUDA FP32 GEMM Design Task

Implement a high-performance CUDA kernel for matrix multiplication C = A · B where:

A is m×k, B is k×n, C is m×n
Data type: FP32
Assume row-major layout unless otherwise stated.

Specify and justify the following:

Tiling and mapping

Choose concrete tile sizes and describe:
- Block tile sizes (BM×BN×BK)
- Threads per block and warp layout
- Shared-memory tiling strategy (double-buffering if any)
- Register tiling per thread (thread tile) and inner-loop unrolling strategy

Memory access efficiency

How you ensure coalesced global loads/stores
How you avoid shared-memory bank conflicts

Edge handling

How to handle tiles when m, n, or k are not multiples of the chosen tile sizes

Occupancy analysis

Given an SM with: 64 warps/SM, 64k registers/SM, and 100 KB shared memory/SM
Using your threads/block, registers/thread, and shared memory/block, compute theoretical occupancy and identify the limiting resource

Synchronization and numerical considerations

Synchronization strategy within a block
Accumulation order and precision considerations

Expected performance vs. cuBLAS

Briefly compare, quantify the expected gap, and justify why

CUDA execution and memory model

Explain grids, blocks, threads, warps, SMs; global/shared/register/constant/texture memory; barriers/atomics
Explain how these inform your design choices

CUDA FP32 GEMM Design Task

Implement a high-performance CUDA kernel for matrix multiplication C = A · B where:

A is m×k, B is k×n, C is m×n
Data type: FP32
Assume row-major layout unless otherwise stated.

Specify and justify the following:

Tiling and mapping

Choose concrete tile sizes and describe:
- Block tile sizes (BM×BN×BK)
- Threads per block and warp layout
- Shared-memory tiling strategy (double-buffering if any)
- Register tiling per thread (thread tile) and inner-loop unrolling strategy

Memory access efficiency

How you ensure coalesced global loads/stores
How you avoid shared-memory bank conflicts

Edge handling

How to handle tiles when m, n, or k are not multiples of the chosen tile sizes

Occupancy analysis

Given an SM with: 64 warps/SM, 64k registers/SM, and 100 KB shared memory/SM
Using your threads/block, registers/thread, and shared memory/block, compute theoretical occupancy and identify the limiting resource

Synchronization and numerical considerations

Synchronization strategy within a block
Accumulation order and precision considerations

Expected performance vs. cuBLAS

Briefly compare, quantify the expected gap, and justify why

CUDA execution and memory model

Explain grids, blocks, threads, warps, SMs; global/shared/register/constant/texture memory; barriers/atomics
Explain how these inform your design choices

Implement CUDA-tiled matrix multiplication and explain architecture

Quick Overview

CUDA FP32 GEMM Design Task

Solution

Comments (0)

Implement CUDA-tiled matrix multiplication and explain architecture

Quick Overview

CUDA FP32 GEMM Design Task

Solution

Comments (0)