Gemm gpu template?

Hi all, is there an official gpu template for single precision gemm and half precision gemm? I see the template here https://github.com/dmlc/tvm/blob/master/topi/recipe/gemm/cuda_gemm_square.py but wonder what’s the best way to use autoTVM with it.

I would assume you would just make a config object, and then replace all the splits and bindings with the appropriate calls in config. Is that right?

We really need an autoTVM template for GPU GEMM. Could you help add it?

And yes, we usually just replace the splits and define them in the AutoTVM config, and sometimes also define the loop order, max unrolling in the config as well. I recommend to hardcode GPU binding in the template instead of adding it to config since it’s quite important and usually we just bind the outer loop to block and thread idx.

Can people share their experiences with tuning GEMM on GPUs? Like realistically what kind of performance wrt cuBLAS should I expect.