I am following the example here: https://docs.tvm.ai/tutorials/autotvm/tune_simple_template.html to write some form of matrix multiplication and run it on GPU. Here are two observations that I need help to explain:
Sizes of the input matrices is NxL and LxM. If the size L is a constant known durning compilation, this gives a 1.25x faster code than using L as a variable, even though generated code looks the same and the value of L didn’t change.
The tutorial suggested using
s[Z].reorder(i_outer, j_outer, k, i_inner, j_inner). I understand why that might be useful, but in practice, not having
reorderat all works better than this. Also, I don’t understand why
i_inner, j_innernot after.
This other tutorial (https://docs.tvm.ai/tutorials/optimize/opt_gemm.html#sphx-glr-tutorials-optimize-opt-gemm-py) discusses more optimization approaches. I am curious what are the ones that are applicable to GPUs (the tutorial focuses on CPU)