Explain CUDA’s execution model (grid, blocks, warps, threads), memory hierarchy (global, shared, registers, constant/texture), and occupancy. Then design an efficient single‑precision GEMM kernel C=A×B: start with a naive kernel; optimize with tiling and shared memory, coalesced loads/stores, avoiding bank conflicts, loop unrolling, and vectorized accesses. Choose grid/block sizes, handle non‑multiple tile edges, and compare throughput to a cuBLAS baseline. How would you overlap host‑device transfers with compute (streams, pinned memory) and measure achieved occupancy, arithmetic intensity, and memory bandwidth?
Quick Answer: This question evaluates proficiency in GPU programming and performance optimization, assessing understanding of the CUDA execution model, memory hierarchy and bank conflicts, occupancy limits, and practical kernel design concerns for single-precision GEMM including tiling, coalescing, vectorized memory accesses and benchmarking.
Solution
Below is a teaching-oriented walkthrough. We assume single precision, row‑major matrices with leading dimensions lda = K, ldb = N, ldc = N unless noted. If using cuBLAS for comparison, remember cuBLAS defaults to column‑major; see notes under benchmarking.
----------------------------------------
Section 1 — CUDA execution model (grid, blocks, warps, threads)
----------------------------------------
- Threads: Smallest execution unit. Each thread has its own registers and program counter.
- Warps: Hardware schedules threads in groups of 32 (a warp). SIMT: a warp issues one instruction for all its active threads. Divergence (if/else) within a warp serializes paths.
- Blocks: A thread block is a set of threads that run on the same SM (Streaming Multiprocessor). Threads in a block can cooperatively load/store shared memory and synchronize via __syncthreads(). Block size and resources affect occupancy.
- Grid: All blocks of a kernel launch form a grid. Blocks are scheduled onto SMs as resources allow; there is no guaranteed ordering across blocks.
- Scheduling: The SM holds multiple resident warps; the warp scheduler selects ready warps each cycle. Latency hiding relies on having enough independent warps and/or instruction-level parallelism (ILP).
----------------------------------------
Section 2 — Memory hierarchy
----------------------------------------
- Registers: Fastest, per-thread. Limited quantity. Excess use spills to local memory (in global DRAM).
- Shared memory: On-chip, per-block scratchpad. Low latency, high bandwidth. Organized in 32 banks (commonly 4 bytes per bank). Bank conflicts occur when multiple threads in a warp access different addresses that map to the same bank; fix via padding or access pattern changes.
- Global memory (device DRAM): High capacity, high latency. Best performance with coalesced access: threads in a warp reading contiguous, properly aligned 32B/64B/128B segments. L2/L1 caches may help but do not replace coalescing.
- L1/L2 caches: Hardware-managed. L2 is global/shared across SMs; L1 can be combined with shared memory depending on architecture.
- Constant memory: Read-only, cached, warp-broadcast friendly when all threads read the same address.
- Texture memory: Read-only path with specialized cache optimized for 2D locality and relaxed coalescing constraints.
- Local memory: Per-thread logical space in DRAM used for register spills and large thread-local arrays.
----------------------------------------
Section 3 — Occupancy
----------------------------------------
- Definition: Occupancy = active warps per SM divided by the architectural maximum warps per SM. It is constrained by threads/block, registers/thread, and shared memory/block.
- Why it matters: More active warps help hide latency of memory and long-latency ops. But beyond a point, extra occupancy may not increase performance, and can even hurt if it forces lower ILP or register spilling. Aim for "sufficient" occupancy (often 25–75%) while maximizing data reuse and ILP.
- Estimation/measurement:
- At compile/launch time: cudaOccupancyMaxActiveBlocksPerMultiprocessor, occupancy calculator spreadsheets.
- At runtime/profiling: Nsight Compute metric sm__warps_active.avg.pct_of_peak_sustained_active (or similar) and limiting factors (registers/shared memory).
----------------------------------------
Section 4 — Baseline GEMM and its issues
----------------------------------------
Goal: C[M×N] = A[M×K] × B[K×N]. FLOPs = 2 × M × N × K.
Naive kernel (each thread computes one C element):
- Mapping: gridDim = (ceil(N/TX), ceil(M/TY)), blockDim = (TX, TY). Typically TX=TY=16 or 32.
- Kernel sketch (row‑major):
1) Compute (i, j) from blockIdx/threadIdx.
2) Accumulate float acc = 0; for (k=0..K-1) acc += A[i*K + k] * B[k*N + j];
3) C[i*N + j] = acc.
- Problems:
- Each thread loads K elements of A row and K elements of B column from global memory with minimal reuse; arithmetic intensity (FLOPs/byte) is low, so performance is bandwidth-limited.
- Many loads are not coalesced (especially B’s column access).
Small numeric example (M=N=K=256): A thread computing C[0,0] reads 512 floats and writes 1 float for 512 multiply-adds (1024 FLOPs). That’s 4*(512+1) = ~2048 bytes for 1024 FLOPs → ~0.5 FLOPs/byte, which is too low for modern GPUs’ compute capability.
----------------------------------------
Section 5 — Tiled/shared-memory GEMM design
----------------------------------------
High-level strategy:
- Partition C into block tiles of size BM×BN computed by a thread block.
- Iterate K in chunks of BK.
- For each k-tile, cooperatively load an A tile (BM×BK) and a B tile (BK×BN) into shared memory with coalesced, vectorized accesses.
- Compute partial products in registers using a per-thread micro-tile (RM×RN) to increase ILP and data reuse.
- Unroll the inner BK loop.
- Avoid shared memory bank conflicts via padding/transposition.
- Write results back coalesced and vectorized.
Recommended pedagogical configuration (works well across many GPUs; tune per architecture):
- Block tile: BM = 128, BN = 128.
- K tile: BK = 8 or 16.
- Threads per block: 256 arranged as (TX, TY) = (16, 16).
- Per-thread micro-tile: RM = BN/TX / 1? A good choice is RM = 8 and RN = 8 so that 16×8 = 128 along M and 16×8 = 128 along N; each thread computes 8×8 outputs in registers (64 accumulators).
- Shared memory:
- As[BM][BK + PAD], Bs[BK][BN + PAD], where PAD = 1 to break modulo-32 bank conflicts along the leading dimension that warps stride over.
- Option: store the B tile transposed in shared memory (BsT[BN][BK+PAD]) to make later per-thread accesses conflict-free and cache-friendly.
Indexing and loads (row‑major, coalesced):
- A load: Each warp reads contiguous segments of a row of A’s tile. Vectorize as float4 if K and alignment allow.
- B load: Reading columns directly is not coalesced; fix by loading B tile from global row-major and writing it transposed into shared memory. Then threads read B from shared memory with row-like coalescing.
Kernel structure (annotated pseudocode):
- Shared buffers:
- __shared__ float As[BM][BK+1];
- __shared__ float BsT[BN][BK+1]; // B tile transposed on load
- Per-thread registers: float c[RM][RN] initialized to 0; float a_frag[RM]; float b_frag[RN].
- For k0 in 0..K step BK:
1) Cooperative load A[blockRow : blockRow+BM, k0 : k0+BK) → As, and B[k0 : k0+BK, blockCol : blockCol+BN) → BsT (transposed). Use vectorized float4 loads when aligned; otherwise scalar with boundary checks.
2) __syncthreads().
3) For kk in 0..BK-1 (pragma unroll):
- Load a_frag[rm] from As rows corresponding to this thread’s micro-tile rows.
- Load b_frag[rn] from BsT rows corresponding to this thread’s micro-tile cols.
- FMA: for rm in 0..RM-1, for rn in 0..RN-1: c[rm][rn] += a_frag[rm] * b_frag[rn].
4) __syncthreads().
- Store c[RM][RN] to C with coalesced, vectorized stores; predicate edges.
Avoiding bank conflicts:
- Padding BK+1 (and/or BN+1 when using transposed B) ensures that when a warp accesses consecutive elements along kk, they map to different banks.
- Alternatively, lay out As with As[KK][BM] and Bs with transposition to match warp access stride.
Vectorization:
- Use float4 for loads/stores along the fastest-moving dimension (columns of row‑major arrays) when addresses are 16‑byte aligned and width is a multiple of 4.
- For edge tiles or misaligned pointers, fall back to scalar or masked vector loads.
Double buffering (optional, advanced):
- On architectures with cp.async, overlap global→shared loads of the next K tile with computation on the current tile, with a 2‑stage pipeline and __syncthreads() or asynchronous barriers.
----------------------------------------
Section 6 — Concrete kernel sketches
----------------------------------------
Notation: M, N, K; gridDim = (ceil_div(N, BN), ceil_div(M, BM)); blockDim = (16, 16). Thread (tx, ty) in [0,15]. Micro-tile origin in C for this thread: row = blockRow + ty*RM, col = blockCol + tx*RN.
Naive kernel (sketch):
- for each thread computing (i,j):
- if (i<M && j<N) {
- float acc=0; for (int k=0;k<K;++k) acc += A[i*K+k]*B[k*N+j];
- C[i*N+j] = acc;
}
Tiled kernel (sketch):
- __shared__ float As[128][BK+1], BsT[128][BK+1];
- float c[8][8] = {0};
- for (int k0=0; k0<K; k0+=BK) {
- Cooperative load (predicate edges):
- Each thread loads 4 A floats (float4) from row (blockRow + ty*8 + r) and cols k0..k0+BK-1.
- Each thread loads 4 B floats (float4) from row (k0 + r) and cols (blockCol + tx*8 .. +7), places into BsT transposed.
- __syncthreads();
- #pragma unroll
for (int kk=0; kk<BK; ++kk) {
- Load a_frag[rm] = As[ty*8+rm][kk] for rm=0..7
- Load b_frag[rn] = BsT[tx*8+rn][kk] for rn=0..7
- FMA over rm,rn: c[rm][rn] += a_frag[rm] * b_frag[rn]
}
- __syncthreads();
}
- Write back c[rm][rn] to C (predicated by i<M, j<N). Vectorize as float4 when possible.
Edge handling:
- On loads: if indices exceed M/K/N, either set shared value to 0 or skip store into shared and later guard reads. Zero‑fill is simple and correct.
- On stores: guard each store with if (i<M && j<N). For vectorized stores, use scalar fallback for tail elements.
- Alternative: run a separate “residue” kernel for borders to keep the fast path branch-free.
Resource/occupancy considerations:
- Registers: 64 accumulators + temporaries per thread; ensure no spilling. Adjust RM, RN, BK to fit register budget (e.g., target ≤ 96 regs/thread on many GPUs).
- Shared memory per block: sizeof(float)*(BM*(BK+1) + BN*(BK+1)) for the two tiles; with BM=BN=128, BK=8 → ~ (128*9 + 128*9)*4B ≈ 9 KB. Fits easily.
- Blocks/SM: With 256 threads/block and modest shared memory, multiple blocks can reside per SM to keep sufficient occupancy.
----------------------------------------
Section 7 — Choosing tile sizes and launch parameters
----------------------------------------
- Start with BM=BN=128, BK=8 or 16, blockDim=(16,16), RM=RN=8.
- Tune BK: Larger BK improves arithmetic intensity but increases shared/register usage. BK=16 is a common sweet spot.
- Tune RM/RN: Larger micro-tiles increase ILP and reuse but raise register pressure.
- Vector width: Prefer 128‑bit (float4). If N or leading dimensions are not multiples of 4, use masked tails.
- Launch: grid.x = ceil_div(N, BN), grid.y = ceil_div(M, BM).
----------------------------------------
Section 8 — Throughput vs cuBLAS baseline
----------------------------------------
Compute GFLOP/s:
- FLOPs = 2 × M × N × K.
- Time via cudaEventElapsedTime over multiple warm runs.
- GFLOP/s = FLOPs / (1e9 × seconds).
Benchmarking steps:
- Warm-up: Run cuBLAS sgemm and your kernel several times to stabilize clocks.
- Measure multiple iterations (e.g., 50–200) and take min or median.
- For cuBLAS (column-major API) with row‑major data:
- Either pass transposed ops (C = (B^T × A^T)^T) or use cublasSgemm with appropriate op flags and swapped arguments; or store your matrices in column-major for both kernels to avoid ambiguity.
Expected results (very rough, architecture-dependent):
- Naive: 1–5% of peak cuBLAS.
- Tiled shared-memory with register blocking and vectorization: 50–90% of cuBLAS.
- cuBLAS: Often near architectural peak for GEMM.
----------------------------------------
Section 9 — Overlapping transfers with compute
----------------------------------------
Goal: Hide H2D and D2H latencies by pipelining copies with kernel execution.
Requirements:
- Pinned (page-locked) host memory: cudaMallocHost or cudaHostAlloc. Enables true async DMA.
- Multiple CUDA streams: At least 2 for double buffering.
- Asynchronous copies: cudaMemcpyAsync.
- For cuBLAS: set per-stream handle via cublasSetStream.
Double-buffering pattern (batches of GEMMs):
- Create two streams s0 and s1.
- For batch t:
- On stream s(t mod 2):
1) cudaMemcpyAsync dA_t ← hA_t (H2D)
2) cudaMemcpyAsync dB_t ← hB_t (H2D)
3) Launch GEMM kernel or cublasSgemm on the same stream
4) cudaMemcpyAsync hC_t ← dC_t (D2H)
- Optionally use cudaEventRecord and cudaStreamWaitEvent to handle dependencies.
- Ensure host buffers are pinned; otherwise, copies will synchronize.
Triple buffering can further smooth pipelines when compute time ≈ copy time.
----------------------------------------
Section 10 — Measuring occupancy, arithmetic intensity, and bandwidth
----------------------------------------
Occupancy:
- Analytical: cudaOccupancyMaxActiveBlocksPerMultiprocessor(&numBlocks, kernel, blockDim.x*blockDim.y, sharedMemBytes). Compute occupancy = (numBlocks * warpsPerBlock) / maxWarpsPerSM.
- Profiling: Nsight Compute metrics like sm__warps_active.avg.pct_of_peak_sustained_active, and limiters (registers/shared memory/thread slots).
Arithmetic intensity (AI):
- Theoretical (no inter-block reuse):
- Bytes ≈ 4 × (M×K + K×N + M×N) for reading A and B once and writing C once.
- AI_theoretical ≈ (2×M×N×K) / [4×(M×K + K×N + M×N)] FLOPs/byte.
- Tiling increases reuse within a block; the asymptotic AI approaches high values as K grows due to reusing tiles BK times.
- Achieved AI (measured):
- Use Nsight Compute: dram__bytes_read.sum and dram__bytes_write.sum (or device-wide equivalent).
- FLOPs_measured = 2×M×N×K per kernel.
- AI_achieved = FLOPs_measured / (bytes_read + bytes_written).
Memory bandwidth:
- Achieved bandwidth = (bytes_read + bytes_written) / time.
- Compare to device peak (from specs). Nsight Compute also reports dram throughput metrics (e.g., dram__throughput.avg.pct_of_peak_sustained_active).
Validation and guardrails:
- Verify numerical correctness against a CPU reference or cuBLAS with random inputs (relative error tolerance ~1e-4 to 1e-3 for FP32).
- Check for register spilling (ptxas -v reports registers; Nsight for local memory traffic).
- Ensure global loads/stores are coalesced (profile for sector requests, L2 hit rate).
- Ensure pinned memory for async copies; otherwise, cudaMemcpyAsync will block.
- Use cudaGetLastError/cudaPeekAtLastError and cudaDeviceSynchronize after development runs.
----------------------------------------
Section 11 — Putting it together: a minimal experimentation plan
----------------------------------------
1) Implement naive kernel; benchmark GFLOP/s on M=N=K=4096.
2) Implement tiled kernel with BM=BN=128, BK=8, blockDim=(16,16), RM=RN=8; add padding to shared memory.
3) Add vectorized float4 loads/stores where aligned; fall back on edges.
4) Add #pragma unroll for the kk loop; confirm no register spilling.
5) Tune BK (8→16), then RM/RN, balancing occupancy (via occupancy API) and performance.
6) Compare to cuBLAS sgemm under identical data layout and measurement protocol.
7) Add double buffering with pinned host memory and two streams if running multiple GEMMs or streaming data.
8) Profile with Nsight Compute to report occupancy, AI, and bandwidth; use roofline view to identify bottlenecks (compute vs memory bound).
This process will take you from a functional baseline to a high-performance GEMM, while building intuition for CUDA’s execution model, memory hierarchy, and the key trade-offs in occupancy and data movement.