Optimizing matrix multiplication for GPU


I am following the example here: https://docs.tvm.ai/tutorials/autotvm/tune_simple_template.html to write some form of matrix multiplication and run it on GPU. Here are two observations that I need help to explain:

  • Sizes of the input matrices is NxL and LxM. If the size L is a constant known durning compilation, this gives a 1.25x faster code than using L as a variable, even though generated code looks the same and the value of L didn’t change.

  • The tutorial suggested using s[Z].reorder(i_outer, j_outer, k, i_inner, j_inner). I understand why that might be useful, but in practice, not having reorder at all works better than this. Also, I don’t understand why k comes before i_inner, j_inner not after.

  • This other tutorial (https://docs.tvm.ai/tutorials/optimize/opt_gemm.html#sphx-glr-tutorials-optimize-opt-gemm-py) discusses more optimization approaches. I am curious what are the ones that are applicable to GPUs (the tutorial focuses on CPU)



I think that tutorial doesn’t really push the performance since it was mainly focusing on how to create a tuning space. For example, the throughput shown in the log is just 10+ GFlop/s, which is far away from what GEMM should have. Maybe that’s also why constant shape doesn’t bring too much speedup.

The best way to understand how to use TVM to optimize GEMM on GPU, in my opinion, is to read the TOPI scheduling implementation here. You might be interested in the function schedule_dense_small_batch and schedule_dense_large_batch.


Very cool. Thanks @comaniac