TOPI dense (MatMul) op performance problem on GPU

Hi, I want to generate a high-performance matmul kernel on GPU with TVM. But I find that the TOPI’s dense (matmul) op is much slower than cublasSgemm on GPU.

For example, for [1024,1024] x [1024,1024] dense op, TVM’s kernel (use “verify_dense(1024, 1024, 1024, use_bias=False)” in topi/tests/python/ runs 4.6453 ms and cuBLAS runs 0.33366 ms.

After I investigate the implementation of the dense op in TOPI, I find that the scheduling policy schedules dense op in a “block reduction” way (i.e., use a thread block to compute one element in output C[i][j]=sum_k(A[i][k]*B[k][j])) without any common optimizations for GPU-based matmul (e.g., tilling, double buffering, etc).

I also find that there are several high-performance scheduling policies for some specific matrix sizes in TOPI/Recipe (e.g.,

So I am confused if I have missed anything in using the TOPI dense (MatMul) op to generate a high-performance cuda kernel.

The kernel in TOPI is mainly optimized for single batch inference. So it uses the “block reduction”.

For other shapes, you can follow the
You can also try auto-tuning by following the

Thank you for your reply. It is reasonable to use “block reduction” for single batch inference. I will try other schedule policies for my cases. Thanks again.