PracHub
QuestionsPremiumLearningGuidesCheatsheetNEWCoaches
|Home/Coding & Algorithms/NVIDIA

Implement CUDA-tiled matrix multiplication and explain architecture

Last updated: Mar 29, 2026

Quick Overview

This question evaluates a candidate's competency in CUDA GPU programming, parallel algorithms, and performance engineering for FP32 matrix multiplication, covering tiling strategies, memory hierarchy (global/shared/register), synchronization, numerical precision, and occupancy/resource analysis.

  • hard
  • NVIDIA
  • Coding & Algorithms
  • Data Scientist

Implement CUDA-tiled matrix multiplication and explain architecture

Company: NVIDIA

Role: Data Scientist

Category: Coding & Algorithms

Difficulty: hard

Interview Round: HR Screen

Implement a high-performance kernel for C = A(m×k) · B(k×n) in CUDA (FP32). Specify: 1) Tile sizes, thread/block layout, shared-memory tiling, register tiling, and unrolling strategy. 2) How you ensure coalesced global loads/stores and avoid shared-memory bank conflicts. 3) Handling of edge tiles when m, n, or k are not multiples of the tile size. 4) Occupancy analysis on an SM with 64 warps/SM, 64k registers/SM, and 100 KB shared memory/SM: given your threads/block, registers/thread, and shared memory/block, compute the occupancy and identify the limiting resource. 5) Synchronization strategy and numerical considerations (accumulation order). 6) Briefly compare expected performance vs. cuBLAS and justify any gap. 7) Explain CUDA’s execution/memory model (grids, blocks, threads, warps, SMs; global/shared/register/constant/texture memory; barriers/atomics) and how it informs your design.

Quick Answer: This question evaluates a candidate's competency in CUDA GPU programming, parallel algorithms, and performance engineering for FP32 matrix multiplication, covering tiling strategies, memory hierarchy (global/shared/register), synchronization, numerical precision, and occupancy/resource analysis.

Related Interview Questions

  • Return all file paths via DFS - NVIDIA (easy)
  • Implement a disk space manager with eviction - NVIDIA (medium)
  • Implement short algorithms on logs, grids, and strings - NVIDIA (hard)
  • Implement encode/decode for list of strings - NVIDIA (easy)
  • Solve small string and API tasks - NVIDIA (medium)
NVIDIA logo
NVIDIA
Oct 13, 2025, 9:49 PM
Data Scientist
HR Screen
Coding & Algorithms
5
0

CUDA FP32 GEMM Design Task

Implement a high-performance CUDA kernel for matrix multiplication C = A · B where:

  • A is m×k, B is k×n, C is m×n
  • Data type: FP32
  • Assume row-major layout unless otherwise stated.

Specify and justify the following:

  1. Tiling and mapping
  • Choose concrete tile sizes and describe:
    • Block tile sizes (BM×BN×BK)
    • Threads per block and warp layout
    • Shared-memory tiling strategy (double-buffering if any)
    • Register tiling per thread (thread tile) and inner-loop unrolling strategy
  1. Memory access efficiency
  • How you ensure coalesced global loads/stores
  • How you avoid shared-memory bank conflicts
  1. Edge handling
  • How to handle tiles when m, n, or k are not multiples of the chosen tile sizes
  1. Occupancy analysis
  • Given an SM with: 64 warps/SM, 64k registers/SM, and 100 KB shared memory/SM
  • Using your threads/block, registers/thread, and shared memory/block, compute theoretical occupancy and identify the limiting resource
  1. Synchronization and numerical considerations
  • Synchronization strategy within a block
  • Accumulation order and precision considerations
  1. Expected performance vs. cuBLAS
  • Briefly compare, quantify the expected gap, and justify why
  1. CUDA execution and memory model
  • Explain grids, blocks, threads, warps, SMs; global/shared/register/constant/texture memory; barriers/atomics
  • Explain how these inform your design choices

Solution

Show

Comments (0)

Sign in to leave a comment

Loading comments...

Browse More Questions

More Coding & Algorithms•More NVIDIA•More Data Scientist•NVIDIA Data Scientist•NVIDIA Coding & Algorithms•Data Scientist Coding & Algorithms
PracHub

Master your tech interviews with 7,500+ real questions from top companies.

Product

  • Questions
  • Learning Tracks
  • Interview Guides
  • Resources
  • Premium
  • For Universities
  • Student Access

Browse

  • By Company
  • By Role
  • By Category
  • Topic Hubs
  • SQL Questions
  • Compare Platforms
  • Discord Community

Support

  • support@prachub.com
  • (916) 541-4762

Legal

  • Privacy Policy
  • Terms of Service
  • About Us

© 2026 PracHub. All rights reserved.