Explain CUDA’s execution model (grid, blocks, warps, threads), memory hierarchy (global, shared, registers, constant/texture), and occupancy. Then design an efficient single‑precision GEMM kernel C=A×B: start with a naive kernel; optimize with tiling and shared memory, coalesced loads/stores, avoiding bank conflicts, loop unrolling, and vectorized accesses. Choose grid/block sizes, handle non‑multiple tile edges, and compare throughput to a cuBLAS baseline. How would you overlap host‑device transfers with compute (streams, pinned memory) and measure achieved occupancy, arithmetic intensity, and memory bandwidth?
Quick Answer: This question evaluates proficiency in GPU programming and performance optimization, assessing understanding of the CUDA execution model, memory hierarchy and bank conflicts, occupancy limits, and practical kernel design concerns for single-precision GEMM including tiling, coalescing, vectorized memory accesses and benchmarking.
Solution
# CUDA execution model, memory hierarchy, and a high-performance SGEMM
This walkthrough assumes single-precision (FP32), **row-major** matrices with $A \in \mathbb{R}^{M\times K}$, $B \in \mathbb{R}^{K\times N}$, $C=A\times B \in \mathbb{R}^{M\times N}$, leading dimensions $\mathrm{lda}=K$, $\mathrm{ldb}=N$, $\mathrm{ldc}=N$, and a recent NVIDIA data-center GPU (Ampere/Hopper class). Total work is $\text{FLOPs} = 2MNK$.
---
## Part 1 — Execution model, memory hierarchy, occupancy
### Execution model (SIMT)
- **Thread.** Smallest execution unit; has private registers and its own program counter.
- **Warp.** The SM schedules threads in groups of **32** (a warp). SIMT means one instruction is issued for all active lanes of the warp. If lanes take different branches (`if/else`), the paths **serialize** (warp divergence) and idle lanes are masked off — divergence is always *within* a warp, never across warps.
- **Block.** A thread block runs entirely on one SM. Its threads can cooperate through **shared memory** and synchronize with `__syncthreads()` (a block-wide barrier). Block size and per-block resource use (registers, shared memory) bound how many blocks fit on an SM.
- **Grid.** All blocks of one launch form the grid. Blocks are dispatched to SMs as resources free up, with **no guaranteed ordering** across blocks — so a kernel must not depend on inter-block execution order without explicit cooperative/grid-sync mechanisms.
- **Scheduling & latency hiding.** Each SM holds many *resident* warps; every cycle the warp scheduler picks a *ready* warp to issue. Long-latency operations (global loads, etc.) are hidden by switching to other ready warps. Latency can therefore be hidden two ways: **more warps** (occupancy) *or* **more independent work per thread** (instruction-level parallelism, ILP).
### Memory hierarchy (fastest/smallest → slowest/largest)
| Space | Scope / lifetime | Notes |
|---|---|---|
| **Registers** | per-thread | Fastest. Limited count per SM; exceeding the budget **spills** to local memory (in DRAM) — a real performance cliff. |
| **Shared memory** | per-block | On-chip scratchpad, ~register-class latency, very high bandwidth. Organized in **32 banks** of 4 bytes. **Bank conflict:** when threads of a warp hit *different* addresses that map to the *same* bank, accesses serialize; same-address reads broadcast for free. Fixed by padding or changing the access stride. |
| **L1 / L2 cache** | L1 per-SM, L2 device-wide | Hardware-managed. On many archs L1 and shared memory share the same physical SRAM and can be partitioned. L2 backstops global traffic and inter-block reuse. |
| **Constant memory** | read-only, device | Small, cached, **broadcasts** efficiently when all threads of a warp read the *same* address. |
| **Texture / surface** | read-only, device | Specialized cache tuned for **2D spatial locality** and relaxed alignment. |
| **Global (device DRAM)** | device | Large, high latency. Best throughput requires **coalescing**: the 32 threads of a warp access a contiguous, aligned 32/64/128-byte segment so the hardware issues the minimum number of memory transactions. |
| **Local memory** | per-thread (in DRAM) | Not really "local" — backing store for register spills and large per-thread arrays. As slow as global. |
The two canonical hazards to call out: **uncoalesced global access** (wastes DRAM bandwidth) and **shared-memory bank conflicts** (serializes on-chip access).
### Occupancy
**Definition.** Occupancy = (active warps per SM) / (architectural max warps per SM).
**Limiters** — occupancy is the *minimum* across independent caps:
$$
\text{occ} \propto \min\!\Big(\big\lfloor\tfrac{\text{regs/SM}}{\text{regs/thread}}\big\rfloor,\ \big\lfloor\tfrac{\text{smem/SM}}{\text{smem/block}}\big\rfloor,\ \text{thread/block-slot caps}\Big).
$$
**Why higher is not always better.** Occupancy is *a* means to hide latency, not the goal. Pushing it forces fewer registers per thread, which kills the register micro-tile (less ILP, less reuse) or triggers spilling. A kernel with a fat micro-tile at ~40–50% occupancy routinely beats a thin one at 100%, because each thread already has enough independent FMAs in flight to hide latency. The right target is *sufficient* occupancy (often 25–75%) while maximizing data reuse and ILP.
**How to know:** analytically via `cudaOccupancyMaxActiveBlocksPerMultiprocessor`; empirically via Nsight Compute (`sm__warps_active.avg.pct_of_peak_sustained_active`) plus the reported limiter (registers / shared memory / block slots).
---
## Part 2 — SGEMM, from naive to high-performance
Goal: $C[M\times N] = A[M\times K]\times B[K\times N]$, FLOPs $=2MNK$.
### 2.1 Naive kernel and why it is slow
One thread per output element:
```cuda
__global__ void sgemm_naive(int M,int N,int K,
const float* A,const float* B,float* C){
int j = blockIdx.x*blockDim.x + threadIdx.x; // column
int i = blockIdx.y*blockDim.y + threadIdx.y; // row
if (i<M && j<N){
float acc = 0.f;
for (int k=0;k<K;++k) acc += A[i*K+k]*B[k*N+j]; // B access strides by N
C[i*N+j] = acc;
}
}
```
Launch: `blockDim=(16,16)`, `gridDim=(ceil(N/16),ceil(M/16))`.
**Why it is bad.** Per thread: $K$ reads from $A$, $K$ from $B$, for $2K$ FLOPs — about $0.5$ FLOP/byte of arithmetic intensity, far below the ridge point of any modern GPU, so it is hard **memory-bound**. Worse, `B[k*N+j]` walks a *column* of a row-major array: across a warp (varying `j`) it is coalesced, but across the `k`-loop each thread re-reads with stride $N$, and there is essentially no on-chip reuse. Expect **1–5% of cuBLAS**.
### 2.2 The idea: reuse via tiling and register blocking
Two levels of reuse fix the arithmetic intensity:
1. **Shared-memory tiling.** A block owns an output tile $BM\times BN$ of $C$ and marches the $K$ dimension in chunks $BK$. For each $k$-chunk it cooperatively loads an $A$ sub-tile ($BM\times BK$) and a $B$ sub-tile ($BK\times BN$) into shared memory **once**, then every thread consumes them many times.
2. **Register micro-tiling.** Each thread computes an $RM\times RN$ sub-block of $C$ (e.g. $8\times8 = 64$ accumulators held in registers). Each loaded $A$ fragment is reused $RN$ times and each $B$ fragment $RM$ times — this is what raises arithmetic intensity into the compute-bound regime.
**A solid, portable configuration** (tune per GPU): $BM=BN=128$, $BK=8$ (try 16), block = $(16,16)=256$ threads, $RM=RN=8$ so $16\times8=128$ along each of $M$ and $N$ — the whole $128\times128$ tile is covered with each thread owning an $8\times8$ micro-tile.
### 2.3 Coalescing and bank conflicts (the two access hazards)
- **Coalesced global loads.** Load $A$ and $B$ tiles so that consecutive lanes read consecutive addresses. For row-major $A$, reading along $K$ within a row is contiguous. For $B$, a *column* load is **not** coalesced — the standard remedy is to load $B$ with a coalesced row-wise pattern and **write it transposed** into shared memory (`BsT`), so each thread's later per-element reads stride cleanly.
- **Bank conflicts.** Pad the leading shared dimension so a warp's consecutive accesses land in distinct banks, e.g. `As[BM][BK+1]`, `BsT[BN][BK+1]`. The `+1` shifts the modulo-32 mapping and breaks the conflict on the dimension warps stride over. (Transposing `B` on load already aligns its access pattern; the pad protects the remaining stride.)
### 2.4 Vectorization, unrolling, edges
- **Vectorized access.** Use `float4` (128-bit) loads/stores along the fastest-moving (column) dimension when pointers and row strides are 16-byte aligned and widths are multiples of 4. This quarters the number of memory instructions and improves coalescing.
- **Unrolling.** `#pragma unroll` the inner `kk` loop (length $BK$) so the compiler schedules the $RM\times RN$ FMAs as a dense, dependency-light block — more ILP, fewer loop-overhead instructions.
- **Edges (non-multiple $M,N,K$).** On *loads*, zero-fill shared memory for out-of-range indices (simple and correct, since multiplying by zero is harmless). On *stores*, predicate with `if (i<M && j<N)` and fall back to scalar stores for the vectorized tail. An alternative for the hottest path is a branch-free fast kernel for the aligned interior plus a separate residue kernel for the borders.
### 2.5 Tiled kernel sketch
```cuda
// BM=BN=128, BK=8, RM=RN=8, block=(16,16)
__shared__ float As[BM][BK+1]; // padded
__shared__ float BsT[BN][BK+1]; // B tile stored TRANSPOSED + padded
float c[RM][RN] = {0.f}; // 64 register accumulators
float a_frag[RM], b_frag[RN];
for (int k0=0; k0<K; k0+=BK){
// cooperative, coalesced, vectorized (float4) loads with edge predication:
// As <- A[blockRow .. blockRow+BM, k0 .. k0+BK) (zero-fill OOB)
// BsT <- transpose( B[k0 .. k0+BK, blockCol .. blockCol+BN) )
__syncthreads();
#pragma unroll
for (int kk=0; kk<BK; ++kk){
#pragma unroll
for (int r=0;r<RM;++r) a_frag[r] = As [threadIdx.y*RM + r][kk];
#pragma unroll
for (int r=0;r<RN;++r) b_frag[r] = BsT[threadIdx.x*RN + r][kk];
#pragma unroll
for (int rm=0; rm<RM; ++rm)
#pragma unroll
for (int rn=0; rn<RN; ++rn)
c[rm][rn] += a_frag[rm]*b_frag[rn]; // FMA
}
__syncthreads();
}
// store c[rm][rn] to C[blockRow+ty*RM+rm][blockCol+tx*RN+rn], predicated + float4 where aligned
```
### 2.6 Launch parameters and the occupancy trade-off (item 7)
- **grid** $=(\lceil N/BN\rceil,\ \lceil M/BM\rceil)$, **block** $=(16,16)$.
- **$BK$:** larger $BK$ raises arithmetic intensity but costs more shared memory and registers; $BK=8$ or $16$ is the usual sweet spot.
- **$RM/RN$:** larger micro-tiles add ILP and reuse but raise register pressure, lowering occupancy. The $8\times8$ choice (64 accumulators) is the classic balance.
- **Resource budget check.** Per block, shared memory $= 4\,\text{B}\times\big(BM(BK{+}1) + BN(BK{+}1)\big)$. With $BM=BN=128$, $BK=8$: $4\times(128\cdot9 + 128\cdot9) = 4\times2304 \approx 9.2\ \text{KB}$ — comfortably allows several resident blocks/SM. Watch registers: ~64 accumulators + fragments + indices; aim for no spilling (verify with `ptxas -v`), often $\le 96$–$128$ regs/thread.
- **The trade-off itself:** this kernel deliberately *spends* registers (lower occupancy) to *buy* a fat micro-tile (high reuse + ILP). That is correct here because the micro-tile supplies enough independent FMAs to hide latency without needing maximal occupancy — exactly the "higher occupancy isn't always better" point from Part 1.
### 2.7 cuBLAS baseline (item 9)
Compute throughput as $\text{GFLOP/s} = \dfrac{2MNK}{10^9\cdot t}$, with $t$ from CUDA events over warm, repeated runs (min or median of, say, 50–200 iterations, after warm-up to stabilize clocks).
**The layout pitfall.** cuBLAS is **column-major**. To multiply row-major matrices with `cublasSgemm` correctly, exploit $C_{\text{row}} = (B^{\top}\!\cdot A^{\top})^{\top}$: call cuBLAS as if computing $C^{\top}=B^{\top}A^{\top}$ by swapping the $A$/$B$ arguments and passing the row-major dimensions/leading dims accordingly (no actual transpose op needed — it falls out of how column-major reinterprets your row-major buffers). The cleanest way to avoid ambiguity is to fix one layout for *both* your kernel and cuBLAS and validate numerically against it.
**Rough expectations (architecture-dependent):**
| Kernel | % of cuBLAS |
|---|---|
| Naive | 1–5% |
| Shared-mem tiled, *no* register blocking | ~20–40% |
| Tiled + $8\times8$ register blocking + vectorization + unroll | ~50–90% |
| cuBLAS | near GEMM peak |
---
## Part 3 — Transfer/compute overlap and measurement
### 3.1 Overlapping host↔device transfers with compute
The aim is to hide H2D/D2H copies behind kernel execution. Requirements:
- **Pinned (page-locked) host memory** (`cudaMallocHost` / `cudaHostAlloc`). Without it, `cudaMemcpyAsync` **silently synchronizes** — a pageable copy cannot DMA-overlap. This is the single most common reason "async" code shows no overlap.
- **Multiple CUDA streams** — at least 2 for double buffering.
- **Async copies** — `cudaMemcpyAsync` on a non-default stream.
- **cuBLAS path** — bind the handle to the stream with `cublasSetStream` so library GEMMs participate in the pipeline.
**Double-buffering pattern** (e.g. a sequence/batch of GEMMs):
```text
create streams s[0], s[1]
for tile t:
s = streams[t & 1]
cudaMemcpyAsync(dA_t, hA_t, H2D, s) // hA_t, hB_t pinned
cudaMemcpyAsync(dB_t, hB_t, H2D, s)
launch GEMM(dA_t, dB_t, dC_t) on s // or cublasSgemm with cublasSetStream(s)
cudaMemcpyAsync(hC_t, dC_t, D2H, s)
```
While stream `s[0]` computes tile $t$, stream `s[1]` copies tile $t{+}1$. Use `cudaEventRecord` + `cudaStreamWaitEvent` for cross-stream dependencies. **Triple buffering** smooths the pipeline further when copy time ≈ compute time.
A separate, complementary technique (Part-3 follow-up) is `cp.async` to overlap global→shared loads with FMAs *inside* the kernel via a multi-stage shared-memory pipeline — that hides DRAM latency, distinct from hiding the host-transfer latency above.
### 3.2 Measuring occupancy, arithmetic intensity, bandwidth
**Occupancy.**
- Analytical: `cudaOccupancyMaxActiveBlocksPerMultiprocessor(&nBlocks, kernel, threadsPerBlock, smemBytes)`, then $\text{occ} = (n_{\text{Blocks}}\cdot \text{warps/block}) / \text{maxWarps/SM}$.
- Measured: Nsight Compute `sm__warps_active.avg.pct_of_peak_sustained_active`, plus the reported limiter.
**Arithmetic intensity (AI = FLOPs/byte).**
- Theoretical lower bound (each operand read once, $C$ written once):
$$
\text{AI}_{\text{theory}} = \frac{2MNK}{4\,(MK + KN + MN)}\ \ \text{FLOP/byte}.
$$
For large square matrices this grows with $K$ — tiling realizes the reuse that approaches this bound, whereas the naive kernel sits far below it.
- Achieved (the honest number): $\text{AI}_{\text{achieved}} = \dfrac{2MNK}{\text{bytes\_read}+\text{bytes\_written}}$, with bytes from Nsight Compute `dram__bytes_read.sum` + `dram__bytes_write.sum`. Caches and reuse make achieved ≠ theoretical, which is exactly why you measure.
**Memory bandwidth.**
$$
\text{BW}_{\text{achieved}} = \frac{\text{bytes\_read}+\text{bytes\_written}}{t},
$$
compared to device spec peak; Nsight also reports `dram__throughput.avg.pct_of_peak_sustained_active`.
**Roofline framing.** Plot AI on the roofline: if your point sits under the *memory* slope you are bandwidth-bound (improve reuse/coalescing); under the *flat* compute ceiling you are compute-bound (improve ILP/FMA scheduling, consider tensor cores).
**Correctness guardrails.** Validate vs cuBLAS or a CPU reference on random inputs (FP32 relative tolerance $10^{-4}$–$10^{-3}$); check `ptxas -v` and Nsight local-memory traffic for spills; confirm coalescing via sector/transaction counters and L2 hit rate; always `cudaGetLastError()` + `cudaDeviceSynchronize()` during bring-up.
---
## Answers to the follow-up questions
**1. High occupancy, low DRAM bandwidth, modest compute — latency-bound?**
This signature (neither memory- nor compute-bound at peak, yet warps are resident) points to **latency/issue-bound** execution — warps are stalling on dependencies or short-scoreboard waits faster than the scheduler can hide. In Nsight Compute, read the **warp stall reasons** (e.g. *long scoreboard* = waiting on global loads, *short scoreboard* = shared-memory/MIO, *wait* = fixed-latency math). First moves: increase ILP per thread (a bigger micro-tile / more independent FMAs in flight), software-pipeline the shared-memory loads (`cp.async`, double-buffered tiles) so compute and the next load overlap, and check for **bank conflicts** if short-scoreboard stalls dominate. Only raise occupancy if the limiter analysis says register/smem pressure is the bottleneck — here it is not.
**2. Tensor cores (TF32/FP16, FP32 accumulate).**
Tensor cores compute small fixed-shape matrix-multiply-accumulates per warp (via WMMA / `mma` PTX or, better, CUTLASS). The tiling reorganizes around the MMA tile and **warp-level** tiles rather than per-thread $8\times8$ register blocks, and operands must be laid out in shared memory in the specific fragment layout the MMA expects (often requiring `ldmatrix` and swizzled, conflict-free shared layouts). The accuracy story changes: **TF32** truncates FP32 inputs to ~10-bit mantissa but **accumulates in FP32**, so it is far more accurate than FP16 inputs while delivering a large throughput jump; **FP16/BF16** inputs with FP32 accumulate trade more input precision for even more throughput. Expected payoff: multiple-× over CUDA-core FMAs — for FP32-accumulate GEMM this is usually the real way to approach peak.
**3. Many small batched GEMMs ($128^3 \times 1000$).**
A big-tile kernel sized for one large GEMM under-utilizes the GPU here: a single $128\times128$ problem is just *one* block tile, so per-GEMM you launch very few blocks and cannot fill all SMs, and kernel-launch / setup overhead is amortized over tiny work. Fixes: **batched GEMM** (`cublasSgemmBatched` / `cublasSgemmStridedBatched`) which packs many problems into one launch and keeps all SMs busy; a **persistent/grid-stride** kernel that maps the batch across the whole grid; or **grouped GEMM** (CUTLASS) for ragged batches. Smaller tile shapes ($BM,BN$ down to ~$32$–$64$) also fit the per-problem size better.
**4. `cp.async` multi-stage in-kernel pipeline.**
`cp.async` issues a global→shared copy that bypasses the register file and proceeds asynchronously, so the FMA loop on the *current* tile overlaps the load of the *next* (and next-next) tile. Implement an $N$-stage ring of shared-memory buffers with `cp.async.commit_group` / `cp.async.wait_group` (or `cuda::pipeline`) gating stage availability. New hazards: (a) **shared-memory pressure** — $N$ stages multiply the smem footprint, lowering occupancy; (b) **synchronization correctness** — you must wait on the right group before reading a stage, or you read stale/partially-copied data; (c) extra **registers/barriers** for the pipeline state. The win is hiding global-load latency *within* the kernel, which is what closes much of the remaining gap to cuBLAS on the FP32 CUDA-core path.