I am following the example here: https://docs.tvm.ai/tutorials/autotvm/tune_simple_template.html to write some form of matrix multiplication and run it on GPU. Here are two observations that I need help to explain:
-
Sizes of the input matrices is NxL and LxM. If the size L is a constant known durning compilation, this gives a 1.25x faster code than using L as a variable, even though generated code looks the same and the value of L didn’t change.
-
The tutorial suggested using
s[Z].reorder(i_outer, j_outer, k, i_inner, j_inner)
. I understand why that might be useful, but in practice, not havingreorder
at all works better than this. Also, I don’t understand whyk
comes beforei_inner, j_inner
not after. -
This other tutorial (https://docs.tvm.ai/tutorials/optimize/opt_gemm.html#sphx-glr-tutorials-optimize-opt-gemm-py) discusses more optimization approaches. I am curious what are the ones that are applicable to GPUs (the tutorial focuses on CPU)
Thanks