Dear all,
I’m trying to build faster gemm kernels with fixed size, A(M,K) * B(K,N) , say M=2048, K=2048, N=8
The goal is to build faster gemm than cublas at some fixed-size.
What kind of schedule should I apply? is it possible?
Thanks a lot!
Dear all,
I’m trying to build faster gemm kernels with fixed size, A(M,K) * B(K,N) , say M=2048, K=2048, N=8
The goal is to build faster gemm than cublas at some fixed-size.
What kind of schedule should I apply? is it possible?
Thanks a lot!
This is likely a very difficult task, as it may be a very well supported configuration of cuBLAS. Typically for this task we would define a template and use AutoTVM. See this tutorial which provides an example of this process for conv2d.
Yes, Thanks a lot! Maybe I should focus on operation fusion… thanks anyway
One example can be find here https://github.com/dmlc/tvm/blob/master/topi/recipe/gemm/cuda_gemm_square.py
But it cannot be faster than cublas on this shape
Has anybody tried using autoTVM on this template?