Explain Amdahl’s law and GPU matmul optimization
Company: NVIDIA
Role: Software Engineer
Category: Software Engineering Fundamentals
Difficulty: medium
Interview Round: Technical Screen
Answer the following systems/performance fundamentals questions (as in a GPU/ML infra interview). Assume a modern NVIDIA-like GPU architecture unless otherwise stated.
1. **Amdahl’s law**: What is it, what does it imply about parallel speedup, and how do you use it to reason about optimizations?
2. **GPU memory hierarchy**: Compare **registers**, **shared memory / SRAM**, **L1/L2 cache**, and **HBM/global memory**. What are typical latency/bandwidth trade-offs, and what code patterns map well to each level?
3. **Threading limits**:
- What is a **warp/wavefront**?
- What limits the maximum number of concurrent threads (per block and per SM), and how do registers/shared-memory usage affect **occupancy**?
4. **Matrix multiplication (matmul)**:
- What is the time complexity of multiplying an \(m\times k\) matrix by a \(k\times n\) matrix?
- How does a tiled GPU implementation work conceptually (what is “tiling/blocking” and why does it help)?
5. **CPU vs GPU matmul**: Why are high-performance implementations different on CPU vs GPU? Discuss SIMD, cache behavior, memory bandwidth, and parallelism.
6. **C++ fundamentals**:
- What is a **virtual function**, and what runtime cost does it introduce?
- What does `inline` mean in C++? When is inlining likely/unsafe/unhelpful?
Quick Answer: This question evaluates understanding of systems and performance fundamentals including parallel speedup (Amdahl’s law), GPU memory hierarchy and threading/occupancy, tiled matrix multiplication and CPU vs GPU performance differences, plus C++ runtime concepts such as virtual functions and inlining.