CUDA FP32 GEMM Design Task
Implement a high-performance CUDA kernel for matrix multiplication C = A · B where:
-
A is m×k, B is k×n, C is m×n
-
Data type: FP32
-
Assume row-major layout unless otherwise stated.
Specify and justify the following:
-
Tiling and mapping
-
Choose concrete tile sizes and describe:
-
Block tile sizes (BM×BN×BK)
-
Threads per block and warp layout
-
Shared-memory tiling strategy (double-buffering if any)
-
Register tiling per thread (thread tile) and inner-loop unrolling strategy
-
Memory access efficiency
-
How you ensure coalesced global loads/stores
-
How you avoid shared-memory bank conflicts
-
Edge handling
-
How to handle tiles when m, n, or k are not multiples of the chosen tile sizes
-
Occupancy analysis
-
Given an SM with: 64 warps/SM, 64k registers/SM, and 100 KB shared memory/SM
-
Using your threads/block, registers/thread, and shared memory/block, compute theoretical occupancy and identify the limiting resource
-
Synchronization and numerical considerations
-
Synchronization strategy within a block
-
Accumulation order and precision considerations
-
Expected performance vs. cuBLAS
-
Briefly compare, quantify the expected gap, and justify why
-
CUDA execution and memory model
-
Explain grids, blocks, threads, warps, SMs; global/shared/register/constant/texture memory; barriers/atomics
-
Explain how these inform your design choices