Faster gemm gpu kernel than cublas?

Dear all,

I’m trying to build faster gemm kernels with fixed size, A(M,K) * B(K,N) , say M=2048, K=2048, N=8

The goal is to build faster gemm than cublas at some fixed-size.

What kind of schedule should I apply? is it possible?

Thanks a lot!

This is likely a very difficult task, as it may be a very well supported configuration of cuBLAS. Typically for this task we would define a template and use AutoTVM. See this tutorial which provides an example of this process for conv2d.

Yes, Thanks a lot! Maybe I should focus on operation fusion… thanks anyway

One example can be find here https://github.com/dmlc/tvm/blob/master/topi/recipe/gemm/cuda_gemm_square.py
But it cannot be faster than cublas on this shape

Has anybody tried using autoTVM on this template?