Optimize CUDA GEMM with tiling and coalescing

Q: Optimize CUDA GEMM with tiling and coalescing

This question evaluates proficiency in GPU programming and performance optimization, assessing understanding of the CUDA execution model, memory hierarchy and bank conflicts, occupancy limits, and practical kernel design concerns for single-precision GEMM including tiling, coalescing, vectorized memory accesses and benchmarking.

Q: How do I approach Coding & Algorithms interview questions?

Coding & Algorithms questions require understanding of core concepts and practice. PracHub provides solutions with explanations to help you master coding & algorithms interviews.

Question

CUDA Execution Model, Memory Hierarchy, and GEMM Kernel Design

Part 1: Concepts

Explain:

CUDA execution model: grid, blocks, warps, threads (SIMT). How threads map to hardware, warp scheduling, synchronization.
Memory hierarchy: registers, shared memory (and bank conflicts), global memory (and coalescing), L1/L2 caches, constant/texture memory, and local memory (register spill).
Occupancy: definition, how it is limited by registers/shared memory/threads, why higher occupancy is not always better.

Part 2: Single-Precision GEMM Kernel (C = A × B)

Design and iterate an efficient sgemm kernel for row‑major matrices:

Start with a naive kernel (each thread computes one C[i, j]).
Optimize with tiling and shared memory.
Ensure coalesced global loads/stores.
Avoid shared memory bank conflicts.
Unroll inner loops.
Use vectorized loads/stores (e.g., float4), with alignment handling.
Choose grid/block sizes and tile shapes. Explain trade‑offs (register/shared‑mem pressure vs occupancy).
Handle edge cases where M, N, or K are not multiples of tile sizes.
Compare achieved throughput to a cuBLAS sgemm baseline (GFLOP/s).

Part 3: Overlap and Measurement

Overlapping host–device transfers with compute: CUDA streams, double buffering, pinned memory, cublasSetStream.
How to measure: achieved occupancy, arithmetic intensity (FLOPs/byte), and memory bandwidth. Include formulas and practical tools.

Optimize CUDA GEMM with tiling and coalescing

CUDA Execution Model, Memory Hierarchy, and GEMM Kernel Design

Part 1: Concepts

Part 2: Single-Precision GEMM Kernel (C = A × B)

Part 3: Overlap and Measurement

Solution

Comments (0)

Optimize CUDA GEMM with tiling and coalescing

Overview

CUDA Execution Model, Memory Hierarchy, and GEMM Kernel Design

Part 1: Concepts

Part 2: Single-Precision GEMM Kernel (C = A × B)

Part 3: Overlap and Measurement

Solution

Comments (0)