Answer the following systems/performance fundamentals questions (as in a GPU/ML infra interview). Assume a modern NVIDIA-like GPU architecture unless otherwise stated.
-
Amdahl’s law
: What is it, what does it imply about parallel speedup, and how do you use it to reason about optimizations?
-
GPU memory hierarchy
: Compare
registers
,
shared memory / SRAM
,
L1/L2 cache
, and
HBM/global memory
. What are typical latency/bandwidth trade-offs, and what code patterns map well to each level?
-
Threading limits
:
-
What is a
warp/wavefront
?
-
What limits the maximum number of concurrent threads (per block and per SM), and how do registers/shared-memory usage affect
occupancy
?
-
Matrix multiplication (matmul)
:
-
What is the time complexity of multiplying an
m×k
matrix by a
k×n
matrix?
-
How does a tiled GPU implementation work conceptually (what is “tiling/blocking” and why does it help)?
-
CPU vs GPU matmul
: Why are high-performance implementations different on CPU vs GPU? Discuss SIMD, cache behavior, memory bandwidth, and parallelism.
-
C++ fundamentals
:
-
What is a
virtual function
, and what runtime cost does it introduce?
-
What does
inline
mean in C++? When is inlining likely/unsafe/unhelpful?