Faster gemm gpu kernel than cublas?

gaianoah · December 11, 2018, 1:56am

Dear all,

I’m trying to build faster gemm kernels with fixed size, A(M,K) * B(K,N) , say M=2048, K=2048, N=8

The goal is to build faster gemm than cublas at some fixed-size.

What kind of schedule should I apply? is it possible?

Thanks a lot!

eqy · December 10, 2018, 6:53pm

This is likely a very difficult task, as it may be a very well supported configuration of cuBLAS. Typically for this task we would define a template and use AutoTVM. See this tutorial which provides an example of this process for conv2d.

gaianoah · December 11, 2018, 1:54am

Yes, Thanks a lot! Maybe I should focus on operation fusion… thanks anyway

merrymercy · December 11, 2018, 3:00am

One example can be find here https://github.com/dmlc/tvm/blob/master/topi/recipe/gemm/cuda_gemm_square.py
But it cannot be faster than cublas on this shape

zwang-asapp · August 15, 2019, 11:07pm

Has anybody tried using autoTVM on this template?