Optimize CUDA GEMM with tiling and coalescing | NVIDIA Interview Question