CUDA Execution Model, Memory Hierarchy, and GEMM Kernel Design
Part 1: Concepts
Explain:
-
CUDA execution model: grid, blocks, warps, threads (SIMT). How threads map to hardware, warp scheduling, synchronization.
-
Memory hierarchy: registers, shared memory (and bank conflicts), global memory (and coalescing), L1/L2 caches, constant/texture memory, and local memory (register spill).
-
Occupancy: definition, how it is limited by registers/shared memory/threads, why higher occupancy is not always better.
Part 2: Single-Precision GEMM Kernel (C = A × B)
Design and iterate an efficient sgemm kernel for row‑major matrices:
-
Start with a naive kernel (each thread computes one C[i, j]).
-
Optimize with tiling and shared memory.
-
Ensure coalesced global loads/stores.
-
Avoid shared memory bank conflicts.
-
Unroll inner loops.
-
Use vectorized loads/stores (e.g., float4), with alignment handling.
-
Choose grid/block sizes and tile shapes. Explain trade‑offs (register/shared‑mem pressure vs occupancy).
-
Handle edge cases where M, N, or K are not multiples of tile sizes.
-
Compare achieved throughput to a cuBLAS sgemm baseline (GFLOP/s).
Part 3: Overlap and Measurement
-
Overlapping host–device transfers with compute: CUDA streams, double buffering, pinned memory, cublasSetStream.
-
How to measure: achieved occupancy, arithmetic intensity (FLOPs/byte), and memory bandwidth. Include formulas and practical tools.