Faster gemm gpu kernel than cublas?


Dear all,

I’m trying to build faster gemm kernels with fixed size, A(M,K) * B(K,N) , say M=2048, K=2048, N=8

The goal is to build faster gemm than cublas at some fixed-size.

What kind of schedule should I apply? is it possible?

Thanks a lot!


This is likely a very difficult task, as it may be a very well supported configuration of cuBLAS. Typically for this task we would define a template and use AutoTVM. See this tutorial which provides an example of this process for conv2d.


Yes, Thanks a lot! Maybe I should focus on operation fusion… thanks anyway


One example can be find here
But it cannot be faster than cublas on this shape


Has anybody tried using autoTVM on this template?