Implement CUDA-tiled matrix multiplication and explain architecture
Company: NVIDIA
Role: Data Scientist
Category: Coding & Algorithms
Difficulty: hard
Interview Round: HR Screen
Implement a high-performance kernel for C = A(m×k) · B(k×n) in CUDA (FP32). Specify: 1) Tile sizes, thread/block layout, shared-memory tiling, register tiling, and unrolling strategy. 2) How you ensure coalesced global loads/stores and avoid shared-memory bank conflicts. 3) Handling of edge tiles when m, n, or k are not multiples of the tile size. 4) Occupancy analysis on an SM with 64 warps/SM, 64k registers/SM, and 100 KB shared memory/SM: given your threads/block, registers/thread, and shared memory/block, compute the occupancy and identify the limiting resource. 5) Synchronization strategy and numerical considerations (accumulation order). 6) Briefly compare expected performance vs. cuBLAS and justify any gap. 7) Explain CUDA’s execution/memory model (grids, blocks, threads, warps, SMs; global/shared/register/constant/texture memory; barriers/atomics) and how it informs your design.
Quick Answer: This question evaluates a candidate's competency in CUDA GPU programming, parallel algorithms, and performance engineering for FP32 matrix multiplication, covering tiling strategies, memory hierarchy (global/shared/register), synchronization, numerical precision, and occupancy/resource analysis.