Implement CUDA-tiled matrix multiplication and explain architecture | NVIDIA