PracHub
QuestionsCoachesLearningGuidesInterview Prep
|Home/Coding & Algorithms/NVIDIA

Optimize CUDA GEMM with tiling and coalescing

Last updated: Jun 24, 2026

Quick Overview

This question evaluates proficiency in GPU programming and performance optimization, assessing understanding of the CUDA execution model, memory hierarchy and bank conflicts, occupancy limits, and practical kernel design concerns for single-precision GEMM including tiling, coalescing, vectorized memory accesses and benchmarking.

  • hard
  • NVIDIA
  • Coding & Algorithms
  • Data Scientist

Optimize CUDA GEMM with tiling and coalescing

Company: NVIDIA

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: hard

Interview Round: HR Screen

Explain CUDA’s execution model (grid, blocks, warps, threads), memory hierarchy (global, shared, registers, constant/texture), and occupancy. Then design an efficient single‑precision GEMM kernel C=A×B: start with a naive kernel; optimize with tiling and shared memory, coalesced loads/stores, avoiding bank conflicts, loop unrolling, and vectorized accesses. Choose grid/block sizes, handle non‑multiple tile edges, and compare throughput to a cuBLAS baseline. How would you overlap host‑device transfers with compute (streams, pinned memory) and measure achieved occupancy, arithmetic intensity, and memory bandwidth?

Quick Answer: This question evaluates proficiency in GPU programming and performance optimization, assessing understanding of the CUDA execution model, memory hierarchy and bank conflicts, occupancy limits, and practical kernel design concerns for single-precision GEMM including tiling, coalescing, vectorized memory accesses and benchmarking.

Solution

# CUDA execution model, memory hierarchy, and a high-performance SGEMM This walkthrough assumes single-precision (FP32), **row-major** matrices with $A \in \mathbb{R}^{M\times K}$, $B \in \mathbb{R}^{K\times N}$, $C=A\times B \in \mathbb{R}^{M\times N}$, leading dimensions $\mathrm{lda}=K$, $\mathrm{ldb}=N$, $\mathrm{ldc}=N$, and a recent NVIDIA data-center GPU (Ampere/Hopper class). Total work is $\text{FLOPs} = 2MNK$. --- ## Part 1 — Execution model, memory hierarchy, occupancy ### Execution model (SIMT) - **Thread.** Smallest execution unit; has private registers and its own program counter. - **Warp.** The SM schedules threads in groups of **32** (a warp). SIMT means one instruction is issued for all active lanes of the warp. If lanes take different branches (`if/else`), the paths **serialize** (warp divergence) and idle lanes are masked off — divergence is always *within* a warp, never across warps. - **Block.** A thread block runs entirely on one SM. Its threads can cooperate through **shared memory** and synchronize with `__syncthreads()` (a block-wide barrier). Block size and per-block resource use (registers, shared memory) bound how many blocks fit on an SM. - **Grid.** All blocks of one launch form the grid. Blocks are dispatched to SMs as resources free up, with **no guaranteed ordering** across blocks — so a kernel must not depend on inter-block execution order without explicit cooperative/grid-sync mechanisms. - **Scheduling & latency hiding.** Each SM holds many *resident* warps; every cycle the warp scheduler picks a *ready* warp to issue. Long-latency operations (global loads, etc.) are hidden by switching to other ready warps. Latency can therefore be hidden two ways: **more warps** (occupancy) *or* **more independent work per thread** (instruction-level parallelism, ILP). ### Memory hierarchy (fastest/smallest → slowest/largest) | Space | Scope / lifetime | Notes | |---|---|---| | **Registers** | per-thread | Fastest. Limited count per SM; exceeding the budget **spills** to local memory (in DRAM) — a real performance cliff. | | **Shared memory** | per-block | On-chip scratchpad, ~register-class latency, very high bandwidth. Organized in **32 banks** of 4 bytes. **Bank conflict:** when threads of a warp hit *different* addresses that map to the *same* bank, accesses serialize; same-address reads broadcast for free. Fixed by padding or changing the access stride. | | **L1 / L2 cache** | L1 per-SM, L2 device-wide | Hardware-managed. On many archs L1 and shared memory share the same physical SRAM and can be partitioned. L2 backstops global traffic and inter-block reuse. | | **Constant memory** | read-only, device | Small, cached, **broadcasts** efficiently when all threads of a warp read the *same* address. | | **Texture / surface** | read-only, device | Specialized cache tuned for **2D spatial locality** and relaxed alignment. | | **Global (device DRAM)** | device | Large, high latency. Best throughput requires **coalescing**: the 32 threads of a warp access a contiguous, aligned 32/64/128-byte segment so the hardware issues the minimum number of memory transactions. | | **Local memory** | per-thread (in DRAM) | Not really "local" — backing store for register spills and large per-thread arrays. As slow as global. | The two canonical hazards to call out: **uncoalesced global access** (wastes DRAM bandwidth) and **shared-memory bank conflicts** (serializes on-chip access). ### Occupancy **Definition.** Occupancy = (active warps per SM) / (architectural max warps per SM). **Limiters** — occupancy is the *minimum* across independent caps: $$ \text{occ} \propto \min\!\Big(\big\lfloor\tfrac{\text{regs/SM}}{\text{regs/thread}}\big\rfloor,\ \big\lfloor\tfrac{\text{smem/SM}}{\text{smem/block}}\big\rfloor,\ \text{thread/block-slot caps}\Big). $$ **Why higher is not always better.** Occupancy is *a* means to hide latency, not the goal. Pushing it forces fewer registers per thread, which kills the register micro-tile (less ILP, less reuse) or triggers spilling. A kernel with a fat micro-tile at ~40–50% occupancy routinely beats a thin one at 100%, because each thread already has enough independent FMAs in flight to hide latency. The right target is *sufficient* occupancy (often 25–75%) while maximizing data reuse and ILP. **How to know:** analytically via `cudaOccupancyMaxActiveBlocksPerMultiprocessor`; empirically via Nsight Compute (`sm__warps_active.avg.pct_of_peak_sustained_active`) plus the reported limiter (registers / shared memory / block slots). --- ## Part 2 — SGEMM, from naive to high-performance Goal: $C[M\times N] = A[M\times K]\times B[K\times N]$, FLOPs $=2MNK$. ### 2.1 Naive kernel and why it is slow One thread per output element: ```cuda __global__ void sgemm_naive(int M,int N,int K, const float* A,const float* B,float* C){ int j = blockIdx.x*blockDim.x + threadIdx.x; // column int i = blockIdx.y*blockDim.y + threadIdx.y; // row if (i<M && j<N){ float acc = 0.f; for (int k=0;k<K;++k) acc += A[i*K+k]*B[k*N+j]; // B access strides by N C[i*N+j] = acc; } } ``` Launch: `blockDim=(16,16)`, `gridDim=(ceil(N/16),ceil(M/16))`. **Why it is bad.** Per thread: $K$ reads from $A$, $K$ from $B$, for $2K$ FLOPs — about $0.5$ FLOP/byte of arithmetic intensity, far below the ridge point of any modern GPU, so it is hard **memory-bound**. Worse, `B[k*N+j]` walks a *column* of a row-major array: across a warp (varying `j`) it is coalesced, but across the `k`-loop each thread re-reads with stride $N$, and there is essentially no on-chip reuse. Expect **1–5% of cuBLAS**. ### 2.2 The idea: reuse via tiling and register blocking Two levels of reuse fix the arithmetic intensity: 1. **Shared-memory tiling.** A block owns an output tile $BM\times BN$ of $C$ and marches the $K$ dimension in chunks $BK$. For each $k$-chunk it cooperatively loads an $A$ sub-tile ($BM\times BK$) and a $B$ sub-tile ($BK\times BN$) into shared memory **once**, then every thread consumes them many times. 2. **Register micro-tiling.** Each thread computes an $RM\times RN$ sub-block of $C$ (e.g. $8\times8 = 64$ accumulators held in registers). Each loaded $A$ fragment is reused $RN$ times and each $B$ fragment $RM$ times — this is what raises arithmetic intensity into the compute-bound regime. **A solid, portable configuration** (tune per GPU): $BM=BN=128$, $BK=8$ (try 16), block = $(16,16)=256$ threads, $RM=RN=8$ so $16\times8=128$ along each of $M$ and $N$ — the whole $128\times128$ tile is covered with each thread owning an $8\times8$ micro-tile. ### 2.3 Coalescing and bank conflicts (the two access hazards) - **Coalesced global loads.** Load $A$ and $B$ tiles so that consecutive lanes read consecutive addresses. For row-major $A$, reading along $K$ within a row is contiguous. For $B$, a *column* load is **not** coalesced — the standard remedy is to load $B$ with a coalesced row-wise pattern and **write it transposed** into shared memory (`BsT`), so each thread's later per-element reads stride cleanly. - **Bank conflicts.** Pad the leading shared dimension so a warp's consecutive accesses land in distinct banks, e.g. `As[BM][BK+1]`, `BsT[BN][BK+1]`. The `+1` shifts the modulo-32 mapping and breaks the conflict on the dimension warps stride over. (Transposing `B` on load already aligns its access pattern; the pad protects the remaining stride.) ### 2.4 Vectorization, unrolling, edges - **Vectorized access.** Use `float4` (128-bit) loads/stores along the fastest-moving (column) dimension when pointers and row strides are 16-byte aligned and widths are multiples of 4. This quarters the number of memory instructions and improves coalescing. - **Unrolling.** `#pragma unroll` the inner `kk` loop (length $BK$) so the compiler schedules the $RM\times RN$ FMAs as a dense, dependency-light block — more ILP, fewer loop-overhead instructions. - **Edges (non-multiple $M,N,K$).** On *loads*, zero-fill shared memory for out-of-range indices (simple and correct, since multiplying by zero is harmless). On *stores*, predicate with `if (i<M && j<N)` and fall back to scalar stores for the vectorized tail. An alternative for the hottest path is a branch-free fast kernel for the aligned interior plus a separate residue kernel for the borders. ### 2.5 Tiled kernel sketch ```cuda // BM=BN=128, BK=8, RM=RN=8, block=(16,16) __shared__ float As[BM][BK+1]; // padded __shared__ float BsT[BN][BK+1]; // B tile stored TRANSPOSED + padded float c[RM][RN] = {0.f}; // 64 register accumulators float a_frag[RM], b_frag[RN]; for (int k0=0; k0<K; k0+=BK){ // cooperative, coalesced, vectorized (float4) loads with edge predication: // As <- A[blockRow .. blockRow+BM, k0 .. k0+BK) (zero-fill OOB) // BsT <- transpose( B[k0 .. k0+BK, blockCol .. blockCol+BN) ) __syncthreads(); #pragma unroll for (int kk=0; kk<BK; ++kk){ #pragma unroll for (int r=0;r<RM;++r) a_frag[r] = As [threadIdx.y*RM + r][kk]; #pragma unroll for (int r=0;r<RN;++r) b_frag[r] = BsT[threadIdx.x*RN + r][kk]; #pragma unroll for (int rm=0; rm<RM; ++rm) #pragma unroll for (int rn=0; rn<RN; ++rn) c[rm][rn] += a_frag[rm]*b_frag[rn]; // FMA } __syncthreads(); } // store c[rm][rn] to C[blockRow+ty*RM+rm][blockCol+tx*RN+rn], predicated + float4 where aligned ``` ### 2.6 Launch parameters and the occupancy trade-off (item 7) - **grid** $=(\lceil N/BN\rceil,\ \lceil M/BM\rceil)$, **block** $=(16,16)$. - **$BK$:** larger $BK$ raises arithmetic intensity but costs more shared memory and registers; $BK=8$ or $16$ is the usual sweet spot. - **$RM/RN$:** larger micro-tiles add ILP and reuse but raise register pressure, lowering occupancy. The $8\times8$ choice (64 accumulators) is the classic balance. - **Resource budget check.** Per block, shared memory $= 4\,\text{B}\times\big(BM(BK{+}1) + BN(BK{+}1)\big)$. With $BM=BN=128$, $BK=8$: $4\times(128\cdot9 + 128\cdot9) = 4\times2304 \approx 9.2\ \text{KB}$ — comfortably allows several resident blocks/SM. Watch registers: ~64 accumulators + fragments + indices; aim for no spilling (verify with `ptxas -v`), often $\le 96$–$128$ regs/thread. - **The trade-off itself:** this kernel deliberately *spends* registers (lower occupancy) to *buy* a fat micro-tile (high reuse + ILP). That is correct here because the micro-tile supplies enough independent FMAs to hide latency without needing maximal occupancy — exactly the "higher occupancy isn't always better" point from Part 1. ### 2.7 cuBLAS baseline (item 9) Compute throughput as $\text{GFLOP/s} = \dfrac{2MNK}{10^9\cdot t}$, with $t$ from CUDA events over warm, repeated runs (min or median of, say, 50–200 iterations, after warm-up to stabilize clocks). **The layout pitfall.** cuBLAS is **column-major**. To multiply row-major matrices with `cublasSgemm` correctly, exploit $C_{\text{row}} = (B^{\top}\!\cdot A^{\top})^{\top}$: call cuBLAS as if computing $C^{\top}=B^{\top}A^{\top}$ by swapping the $A$/$B$ arguments and passing the row-major dimensions/leading dims accordingly (no actual transpose op needed — it falls out of how column-major reinterprets your row-major buffers). The cleanest way to avoid ambiguity is to fix one layout for *both* your kernel and cuBLAS and validate numerically against it. **Rough expectations (architecture-dependent):** | Kernel | % of cuBLAS | |---|---| | Naive | 1–5% | | Shared-mem tiled, *no* register blocking | ~20–40% | | Tiled + $8\times8$ register blocking + vectorization + unroll | ~50–90% | | cuBLAS | near GEMM peak | --- ## Part 3 — Transfer/compute overlap and measurement ### 3.1 Overlapping host↔device transfers with compute The aim is to hide H2D/D2H copies behind kernel execution. Requirements: - **Pinned (page-locked) host memory** (`cudaMallocHost` / `cudaHostAlloc`). Without it, `cudaMemcpyAsync` **silently synchronizes** — a pageable copy cannot DMA-overlap. This is the single most common reason "async" code shows no overlap. - **Multiple CUDA streams** — at least 2 for double buffering. - **Async copies** — `cudaMemcpyAsync` on a non-default stream. - **cuBLAS path** — bind the handle to the stream with `cublasSetStream` so library GEMMs participate in the pipeline. **Double-buffering pattern** (e.g. a sequence/batch of GEMMs): ```text create streams s[0], s[1] for tile t: s = streams[t & 1] cudaMemcpyAsync(dA_t, hA_t, H2D, s) // hA_t, hB_t pinned cudaMemcpyAsync(dB_t, hB_t, H2D, s) launch GEMM(dA_t, dB_t, dC_t) on s // or cublasSgemm with cublasSetStream(s) cudaMemcpyAsync(hC_t, dC_t, D2H, s) ``` While stream `s[0]` computes tile $t$, stream `s[1]` copies tile $t{+}1$. Use `cudaEventRecord` + `cudaStreamWaitEvent` for cross-stream dependencies. **Triple buffering** smooths the pipeline further when copy time ≈ compute time. A separate, complementary technique (Part-3 follow-up) is `cp.async` to overlap global→shared loads with FMAs *inside* the kernel via a multi-stage shared-memory pipeline — that hides DRAM latency, distinct from hiding the host-transfer latency above. ### 3.2 Measuring occupancy, arithmetic intensity, bandwidth **Occupancy.** - Analytical: `cudaOccupancyMaxActiveBlocksPerMultiprocessor(&nBlocks, kernel, threadsPerBlock, smemBytes)`, then $\text{occ} = (n_{\text{Blocks}}\cdot \text{warps/block}) / \text{maxWarps/SM}$. - Measured: Nsight Compute `sm__warps_active.avg.pct_of_peak_sustained_active`, plus the reported limiter. **Arithmetic intensity (AI = FLOPs/byte).** - Theoretical lower bound (each operand read once, $C$ written once): $$ \text{AI}_{\text{theory}} = \frac{2MNK}{4\,(MK + KN + MN)}\ \ \text{FLOP/byte}. $$ For large square matrices this grows with $K$ — tiling realizes the reuse that approaches this bound, whereas the naive kernel sits far below it. - Achieved (the honest number): $\text{AI}_{\text{achieved}} = \dfrac{2MNK}{\text{bytes\_read}+\text{bytes\_written}}$, with bytes from Nsight Compute `dram__bytes_read.sum` + `dram__bytes_write.sum`. Caches and reuse make achieved ≠ theoretical, which is exactly why you measure. **Memory bandwidth.** $$ \text{BW}_{\text{achieved}} = \frac{\text{bytes\_read}+\text{bytes\_written}}{t}, $$ compared to device spec peak; Nsight also reports `dram__throughput.avg.pct_of_peak_sustained_active`. **Roofline framing.** Plot AI on the roofline: if your point sits under the *memory* slope you are bandwidth-bound (improve reuse/coalescing); under the *flat* compute ceiling you are compute-bound (improve ILP/FMA scheduling, consider tensor cores). **Correctness guardrails.** Validate vs cuBLAS or a CPU reference on random inputs (FP32 relative tolerance $10^{-4}$–$10^{-3}$); check `ptxas -v` and Nsight local-memory traffic for spills; confirm coalescing via sector/transaction counters and L2 hit rate; always `cudaGetLastError()` + `cudaDeviceSynchronize()` during bring-up. --- ## Answers to the follow-up questions **1. High occupancy, low DRAM bandwidth, modest compute — latency-bound?** This signature (neither memory- nor compute-bound at peak, yet warps are resident) points to **latency/issue-bound** execution — warps are stalling on dependencies or short-scoreboard waits faster than the scheduler can hide. In Nsight Compute, read the **warp stall reasons** (e.g. *long scoreboard* = waiting on global loads, *short scoreboard* = shared-memory/MIO, *wait* = fixed-latency math). First moves: increase ILP per thread (a bigger micro-tile / more independent FMAs in flight), software-pipeline the shared-memory loads (`cp.async`, double-buffered tiles) so compute and the next load overlap, and check for **bank conflicts** if short-scoreboard stalls dominate. Only raise occupancy if the limiter analysis says register/smem pressure is the bottleneck — here it is not. **2. Tensor cores (TF32/FP16, FP32 accumulate).** Tensor cores compute small fixed-shape matrix-multiply-accumulates per warp (via WMMA / `mma` PTX or, better, CUTLASS). The tiling reorganizes around the MMA tile and **warp-level** tiles rather than per-thread $8\times8$ register blocks, and operands must be laid out in shared memory in the specific fragment layout the MMA expects (often requiring `ldmatrix` and swizzled, conflict-free shared layouts). The accuracy story changes: **TF32** truncates FP32 inputs to ~10-bit mantissa but **accumulates in FP32**, so it is far more accurate than FP16 inputs while delivering a large throughput jump; **FP16/BF16** inputs with FP32 accumulate trade more input precision for even more throughput. Expected payoff: multiple-× over CUDA-core FMAs — for FP32-accumulate GEMM this is usually the real way to approach peak. **3. Many small batched GEMMs ($128^3 \times 1000$).** A big-tile kernel sized for one large GEMM under-utilizes the GPU here: a single $128\times128$ problem is just *one* block tile, so per-GEMM you launch very few blocks and cannot fill all SMs, and kernel-launch / setup overhead is amortized over tiny work. Fixes: **batched GEMM** (`cublasSgemmBatched` / `cublasSgemmStridedBatched`) which packs many problems into one launch and keeps all SMs busy; a **persistent/grid-stride** kernel that maps the batch across the whole grid; or **grouped GEMM** (CUTLASS) for ragged batches. Smaller tile shapes ($BM,BN$ down to ~$32$–$64$) also fit the per-problem size better. **4. `cp.async` multi-stage in-kernel pipeline.** `cp.async` issues a global→shared copy that bypasses the register file and proceeds asynchronously, so the FMA loop on the *current* tile overlaps the load of the *next* (and next-next) tile. Implement an $N$-stage ring of shared-memory buffers with `cp.async.commit_group` / `cp.async.wait_group` (or `cuda::pipeline`) gating stage availability. New hazards: (a) **shared-memory pressure** — $N$ stages multiply the smem footprint, lowering occupancy; (b) **synchronization correctness** — you must wait on the right group before reading a stage, or you read stale/partially-copied data; (c) extra **registers/barriers** for the pipeline state. The win is hiding global-load latency *within* the kernel, which is what closes much of the remaining gap to cuBLAS on the FP32 CUDA-core path.

Related Interview Questions

  • Compute the Final Robot Score - NVIDIA (easy)
  • Return all file paths via DFS - NVIDIA (easy)
  • Implement a disk space manager with eviction - NVIDIA (medium)
  • Implement short algorithms on logs, grids, and strings - NVIDIA (hard)
  • Implement encode/decode for list of strings - NVIDIA (easy)
|Home/Coding & Algorithms/NVIDIA

Optimize CUDA GEMM with tiling and coalescing

NVIDIA logo
NVIDIA
Oct 13, 2025, 9:49 PM
hardData ScientistHR ScreenCoding & Algorithms
8
0

CUDA Execution Model, Memory Hierarchy, and GEMM Kernel Design

You are interviewing for a Data Scientist / GPU-software role at NVIDIA. The interviewer wants to see that you understand how the GPU actually runs your code and that you can turn that understanding into a high-performance kernel. The exercise builds from CUDA fundamentals to a hand-optimized single-precision matrix-multiply (SGEMM), then to streaming and measurement.

The matrices are single-precision (FP32), row-major, with A∈RM×KA \in \mathbb{R}^{M \times K}A∈RM×K, B∈RK×NB \in \mathbb{R}^{K \times N}B∈RK×N, and C=A×B∈RM×NC = A \times B \in \mathbb{R}^{M \times N}C=A×B∈RM×N. Unless you state otherwise, assume leading dimensions lda=K\mathrm{lda}=Klda=K, ldb=N\mathrm{ldb}=Nldb=N, ldc=N\mathrm{ldc}=Nldc=N.

Constraints & Assumptions

  • Target a recent NVIDIA data-center GPU (Ampere/Hopper class). You may name the specific feature set you assume (e.g. 32-thread warps, 32 shared-memory banks of 4 bytes, cp.async , tensor cores you are choosing not to use here since this is an FP32 CUDA-core exercise).
  • Total floating-point work is fixed: FLOPs=2⋅M⋅N⋅K\text{FLOPs} = 2 \cdot M \cdot N \cdot KFLOPs=2⋅M⋅N⋅K (one multiply + one add per inner-product term).
  • You are writing the kernel by hand in CUDA C++ (CUDA-core FMAs), not calling a library — except cuBLAS, which is used only as the baseline to beat/approach .
  • Correctness target: relative error vs a reference of roughly 10−410^{-4}10−4 – 10−310^{-3}10−3 is acceptable for FP32 accumulation.

Clarifying Questions to Ask

  • What are the representative matrix shapes — large square ( M=N=KM=N=KM=N=K , e.g. 4096), tall/skinny, or many small batched GEMMs? The optimal tiling and whether transfers dominate depend heavily on this.
  • Which GPU architecture and compute capability is the target? It determines warp size assumptions, shared-memory capacity, register file size, and whether cp.async / async barriers are available.
  • Are the matrices already resident in device memory, or must they be streamed from host each call? This decides whether Part 3 (transfer overlap) is on the critical path at all.
  • What is the success metric — peak GFLOP/s on one shape, a fraction of cuBLAS, or end-to-end wall-clock including transfers?
  • Is FP32 accumulation required, or is TF32 / mixed precision acceptable? (It changes both accuracy expectations and the achievable peak.)

Part 1: CUDA execution model, memory hierarchy, and occupancy

Explain, at a level that shows you understand the hardware mapping:

  • Execution model: grid, blocks, warps, threads, and the SIMT model. How threads map to an SM, how warps are scheduled, what causes warp divergence, and how block-level synchronization works.
  • Memory hierarchy: registers, shared memory (including bank conflicts), global/device memory (including coalescing), L1/L2 caches, constant and texture memory, and local memory (register spill). State the latency/bandwidth/scope trade-off of each.
  • Occupancy: define it precisely, list what limits it (registers/thread, shared memory/block, threads/block, block-slots/SM), and explain why higher occupancy is not always faster.

What This Part Should Cover

  • Correct hardware mapping: threads→lanes, warps→scheduler, blocks→SM, and why there is no guaranteed inter-block ordering.
  • A precise, ranked picture of the memory hierarchy (latency, bandwidth, scope/lifetime) and the two canonical hazards: shared-memory bank conflicts and uncoalesced global access.
  • A correct definition of occupancy as a ratio of active warps to the architectural max, its three limiters, and the nuanced argument that latency hiding can come from ILP as well as occupancy.

Part 2: Single-precision GEMM kernel (C=A×BC = A \times BC=A×B)

Design and iterate an efficient SGEMM kernel for row-major matrices. Walk through the optimization ladder:

  1. A naive kernel where each thread computes one C[i,j]C[i,j]C[i,j] , and why it is slow (quantify the arithmetic intensity).
  2. Tiling with shared memory (block tiles in M,NM,NM,N ; a KKK -loop in chunks of BKBKBK ).
  3. Coalesced global loads/stores.
  4. Avoiding shared-memory bank conflicts.
  5. Inner-loop unrolling.
  6. Vectorized loads/stores (e.g. float4 ) with alignment handling.
  7. Choosing grid/block sizes, tile shapes, and a per-thread register micro-tile; explain the register/shared-memory-pressure vs occupancy trade-off.
  8. Handling edges where MMM , NNN , or KKK are not multiples of the tile sizes.
  9. Comparing achieved throughput (GFLOP/s) to a cuBLAS SGEMM baseline, including the layout pitfall.

Clarifying Questions for this Part

  • Should the kernel be general-purpose (arbitrary M,N,KM,N,KM,N,K including odd tails) or may I assume sizes are multiples of the tile for the fast path and handle the residue separately?
  • Is float4 alignment guaranteed (pointers and row strides multiples of 4), or must I support misaligned/odd leading dimensions with scalar tails?

What This Part Should Cover

  • The optimization ladder with a reason at each rung (naive → shared-memory tiling → register/micro-tiling → vectorization → unrolling), and a quantitative arithmetic-intensity argument for why the naive version is bandwidth-bound.
  • Concrete, correct mechanics: coalesced loads, transposed- B /padded shared memory to kill bank conflicts, float4 vectorization with a scalar fallback, and predicated edge handling.
  • A defensible launch configuration (tile sizes, block dims, micro-tile) with the register/shared-memory vs occupancy trade-off articulated, plus a correct cuBLAS comparison protocol (FLOP count, timing, and the column-major layout pitfall).

Part 3: Overlapping transfers with compute, and measurement

  • Overlap: how to hide host↔device transfer latency behind compute — pinned (page-locked) host memory, multiple CUDA streams, cudaMemcpyAsync , double/triple buffering, and cublasSetStream for the library path.
  • Measurement: how to measure achieved occupancy, arithmetic intensity (FLOPs/byte), and memory bandwidth. Give the formulas and name the practical tools/metrics.

What This Part Should Cover

  • A correct, dependency-safe overlap pipeline: pinned memory, ≥2 streams, async copies, and the pitfall that pageable memory defeats the whole thing.
  • The three measurement formulas (occupancy as active/max warps; AI =FLOPs/bytes= \text{FLOPs}/\text{bytes}=FLOPs/bytes ; bandwidth =bytes/time= \text{bytes}/\text{time}=bytes/time ) and the right tools to obtain each (occupancy API / Nsight Compute metrics, CUDA events).
  • Awareness of the roofline framing: whether the kernel is compute- or memory-bound and what to do about each.

What a Strong Answer Covers

Across all three parts, a strong candidate connects hardware → kernel design → measurement as one chain rather than three disconnected topics. The cross-cutting dimensions:

  • Causal reasoning, not recitation: every optimization in Part 2 is justified by a Part 1 mechanism (coalescing↔warp access, bank padding↔32 banks, micro-tiling↔occupancy/ILP), and every claim in Parts 2–3 is something they would verify with the Part 3 tools rather than assert.
  • Quantification: arithmetic intensity, FLOP/byte, occupancy ratios, and a realistic "% of cuBLAS" expectation — numbers, not adjectives.
  • Correctness and edge-safety: FP32 accuracy expectations, predicated/residue handling for non-multiple sizes, and validation against cuBLAS or a CPU reference.
  • Honest trade-off judgment: knowing when not to push occupancy, when register spilling has set in ( ptxas -v ), and when the kernel is memory- vs compute-bound (roofline).

Follow-up Questions

  • Your tiled kernel hits ~60% of cuBLAS and Nsight reports high occupancy but low achieved DRAM bandwidth and modest compute throughput. How do you decide whether you are latency-bound, and what would you change first?
  • How would the design change if you were allowed to use tensor cores (TF32/FP16 with FP32 accumulate) instead of CUDA-core FMAs? What does that do to the tiling, the operand layout, and the accuracy story?
  • For many small batched GEMMs (e.g. 128×128×128×1000128\times128\times128 \times 1000128×128×128×1000 ), why might your big-tile kernel underperform, and how would batched/persistent-kernel strategies help?
  • Walk through using cp.async and a multi-stage shared-memory pipeline to hide global-memory latency within the kernel (as opposed to hiding host-transfer latency across kernels). What new hazards does that introduce?

Submit Your Answer to Earn 20XP

Sign in to leave a comment

Loading comments...

Browse More Questions

More Coding & Algorithms•More NVIDIA•More Data Scientist•NVIDIA Data Scientist•NVIDIA Coding & Algorithms•Data Scientist Coding & Algorithms
PracHub

Master your tech interviews with 8,000+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • AI Coding Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.