TOPI dense (MatMul) op performance problem on GPU

Hi, I want to generate a high-performance matmul kernel on GPU with TVM. But I find that the TOPI’s dense (matmul) op is much slower than cublasSgemm on GPU.

For example, for [1024,1024] x [1024,1024] dense op, TVM’s kernel (use “verify_dense(1024, 1024, 1024, use_bias=False)” in topi/tests/python/test_topi_dense.py) runs 4.6453 ms and cuBLAS runs 0.33366 ms.

After I investigate the implementation of the dense op in TOPI, I find that the scheduling policy schedules dense op in a “block reduction” way (i.e., use a thread block to compute one element in output C[i][j]=sum_k(A[i][k]*B[k][j])) without any common optimizations for GPU-based matmul (e.g., tilling, double buffering, etc).

I also find that there are several high-performance scheduling policies for some specific matrix sizes in TOPI/Recipe (e.g., cuda_gemm_square.py).

So I am confused if I have missed anything in using the TOPI dense (MatMul) op to generate a high-performance cuda kernel.

The kernel in TOPI is mainly optimized for single batch inference. So it uses the “block reduction”.

For other shapes, you can follow the cuda_gemm_square.py.
You can also try auto-tuning by following the gemm_int8.py

Thank you for your reply. It is reasonable to use “block reduction” for single batch inference. I will try other schedule policies for my cases. Thanks again.