Optimizing matrix multiplication for GPU

ibeltagy · October 3, 2019, 7:08am

I am following the example here: https://docs.tvm.ai/tutorials/autotvm/tune_simple_template.html to write some form of matrix multiplication and run it on GPU. Here are two observations that I need help to explain:

Sizes of the input matrices is NxL and LxM. If the size L is a constant known durning compilation, this gives a 1.25x faster code than using L as a variable, even though generated code looks the same and the value of L didn’t change.
The tutorial suggested using s[Z].reorder(i_outer, j_outer, k, i_inner, j_inner). I understand why that might be useful, but in practice, not having reorder at all works better than this. Also, I don’t understand why k comes before i_inner, j_inner not after.
This other tutorial (https://docs.tvm.ai/tutorials/optimize/opt_gemm.html#sphx-glr-tutorials-optimize-opt-gemm-py) discusses more optimization approaches. I am curious what are the ones that are applicable to GPUs (the tutorial focuses on CPU)

Thanks

comaniac · October 3, 2019, 5:16pm

I think that tutorial doesn’t really push the performance since it was mainly focusing on how to create a tuning space. For example, the throughput shown in the log is just 10+ GFlop/s, which is far away from what GEMM should have. Maybe that’s also why constant shape doesn’t bring too much speedup.

The best way to understand how to use TVM to optimize GEMM on GPU, in my opinion, is to read the TOPI scheduling implementation here. You might be interested in the function schedule_dense_small_batch and schedule_dense_large_batch.

ibeltagy · October 3, 2019, 5:31pm

Very cool. Thanks @comaniac

ibeltagy · March 19, 2020, 1:03am

@comaniac, so I followed schedule_dense_small_batch to implement batched matrix multiplication, and it gave a nice speedup. However, it is still 10x-15x slower than PyTorch’s torch.bmm depending on the GPU.

Here’s the full code:

gist.github.com

https://gist.github.com/ibeltagy/61ccddd83fdd3956573289297efd7f13

bmm.py

import time
import torch
import tvm
from tvm.contrib import dlpack


def _compile_function(b0: int = 4, b1: int = 4, b2: int = 16):
    bsz = tvm.var('bsz')
    d1 = tvm.var('d1')
    d2 = tvm.var('d2')

This file has been truncated. show original

Any suggestions what I might be missing?

Thanks

comaniac · March 19, 2020, 1:07am

Did you compare the performance with small batch? I remember the optimization applied to that schedule was also limited. If yours is better than that schedule in TOPI, then you are eligible to help us improve the quality of the TOPI schedules

ibeltagy · March 19, 2020, 4:04am

How would you do that? I tried the following

    bsz = te.var('bsz')
    d1 = te.var('d1')
    d2 = te.var('d2')
    d3 = te.var('d3')
    A = te.placeholder((bsz, d1, d3), name='A', dtype='float32')  # first tensor
    B = te.placeholder((bsz, d2, d3), name='B', dtype='float32')  # second tensor
    R = topi.nn.batch_matmul(A, B)
    with tvm.target.cuda():
        s = topi.cuda.schedule_reduce(R)

but got

Traceback (most recent call last):

  File "scripts/band_mm_minimal.py", line 94, in <module>
    mm_fun_topi = _compile_function_topi()

  File "scripts/band_mm_minimal.py", line 63, in _compile_function_topi
    s = topi.cuda.schedule_reduce(R)

  File "/usr/tvm/topi/python/topi/cuda/reduction.py", line 147, in schedule_reduce
    traverse_after_reduce(outs[0].op)

  File "/usr/tvm/topi/python/topi/cuda/reduction.py", line 143, in traverse_after_reduce
    raise RuntimeError("Unsupported operator: %s" % operator.tag)

RuntimeError: Unsupported operator: batch_matmul

Which schedule should I use?

ibeltagy · March 19, 2020, 4:16am

And s = topi.cuda.batch_matmul.schedule_batch_matmul(R) is 280x slower than PyTorch.

    bsz = te.var('bsz')
    d1 = te.var('d1')
    d2 = te.var('d2')
    d3 = te.var('d3')
    A = te.placeholder((bsz, d1, d3), name='A', dtype='float32')  # first tensor
    B = te.placeholder((bsz, d2, d3), name='B', dtype='float32')  # second tensor
    R = topi.nn.batch_matmul(A, B)
    with tvm.target.cuda():
        s = topi.cuda.batch_matmul.schedule_batch_matmul(R)

tqchen · March 19, 2020, 3:48pm

Most of the ops so far are specialized for constant dimensions. So in this case we will need to put in constant dimensions(instead of te.var). Because batch_matmul itself is compute intensive, we might also benefit from AutoTVM, instead of using the default one.

Note that specializations necessary for this type of kernels, this is also what cudnn will do(which pytorch calls under the hood). The easiest way to make use such kernel is to generate a few candidate and and fall back to a normal kernel when the shape does not match. There are also more efforts in the community to work on automatic bucketing etc.

ibeltagy · March 19, 2020, 6:34pm

Thanks @tqchen, this was helpful. Using constants as follows:

    bsz = 24
    d1 = 16384
    d2 = 512
    d3 = 64
    A = te.placeholder((bsz, d1, d3), name='A', dtype='float32')  # first tensor
    B = te.placeholder((bsz, d2, d3), name='B', dtype='float32')  # second tensor
    R = topi.nn.batch_matmul(A, B)
    s = topi.cuda.batch_matmul.schedule_batch_matmul(R)

makes it 5x faster, which is still 3x slower than PyToch, but much better. So,

How do I improve this further? use AutoTVM? I have tried it before for custom schedules that I implemented, but not sure how to do that for schedule_batch_matmul.
More importantly, in my application, d2, d3 are constants, but bsz, d1 are not; bsz has 16 possible values, and d1 has 32 possible values, making it a total of 512 configuration. Does that mean I need to compile 512 different version of the code? there must be a better solution. Maybe the schedule can be updated to support range of values instead of constants?

tqchen · March 19, 2020, 6:45pm

others might be able to jump in to comment about running autotvm, which will certainly help.

The easiest way to make use such kernel is to generate a few candidate and and fall back to a normal kernel when the shape does not match. There are also more efforts in the community to work on automatic bucketing etc.

The above comment(about generating a few key ones and fallback to normal kernel) is a quick work around for the constant constraints. I agree that we will need better solutions for dynamic workloads, there are some ongoing efforts, see dynamic workload presos in https://tvmconf.org/#about-tvmconf

ibeltagy · March 19, 2020, 7:51pm

generating a few key ones and fallback to normal kernel

The 16 values for bsz, and 32 values for d1 are the key ones, e.g, d1 ranges from 1 to 16384, but I can pad it to multiples of 512.

ibeltagy · March 19, 2020, 11:18pm

@comaniac, do you know about this? how would you optimizer topi.cuda.batch_matmul.schedule_batch_matmul(R) using autotvm?

comaniac · March 19, 2020, 11:48pm

I just checked it out and unfortunately batch_matmul schedulefor CUDA does not have a tuning space (https://github.com/apache/incubator-tvm/blob/master/topi/python/topi/cuda/batch_matmul.py). A schedule with a tuning space should have a set of knobs defined in the schedule, such as this. You are very welcome to add a tuning space to topi.cuda.batch_matmul.schedule_batch_matmul.

After a tuning space has been created, you could follow this tutorial to register the schedule function as an AutoTVM task and tune it to see the performance improvement.

ibeltagy · March 20, 2020, 1:04am

Thanks @comaniac.

Do you know what params in schedule_batch_matmul need tuning? probably the 16s and the 64s?

Also, do you have a high level description of that schedule how it is splitting the workload?

comaniac · March 20, 2020, 4:53pm

Sorry I didn’t dive into batch_matmul schedule function so I have no idea. You could do some experiments and file a PR once you identified a tuning space.

I didn’t quite get your second question. Are you asking how define_split works?

ibeltagy · March 20, 2020, 8:51pm

sorry the question wasn’t clear. I was wondering if there’s documentation or high level description for the schedule (no comments in the code), but I guess the answer is no.

comaniac · March 20, 2020, 8:55pm

Sorry as you expected we don’t have that documents. All schedules were developed and committed by contributors. Since end-users are not supposed to read the schedule functions, how the schedule function was developed is not documented. The similar tutorials you can find is like this one.

ibeltagy · March 20, 2020, 9:12pm

@comaniac, hopefully last question, do you know how to save multiple modules into one .so file? I want to save multiple versions of the function as @tqchen suggested, but tvm.runtime.export_library saves only one, and tvm.cc.create_shared doesn’t link the host code with the target code as tvm.runtime.export_library does. I am sure this can be done with gcc or something, but was wondering if tvm already has a solution for this given that it is a common use case.

comaniac · March 20, 2020, 9:16pm

AFAIK you can only keep one module in one .so file. If you want to keep multiple versions to deal with dynamic workloads, it’s better to wait for the corresponding support by @haichen. Otherwise, you can only integrate all versions to a single schedule function for now.

ibeltagy · March 20, 2020, 11:53pm

Given that it is not easy to get multiple versions of the function compiled together, I tried to see if schedule_batch_matmul can be changed to work for dynamic input sizes (at the expense of being slightly less efficient). It seem that this line and the one above can be changed to constants y_bn = x_bn = 64. However, when I do so, I get the following error:

TVMError: Check failed: condition_counter() == 0 (1 vs. 0) : Cannot insert syncs inside condition

The full codegen and the error are here. Any suggestions what could be wrong? my guess is that some of the scheduling primitives are still relying on constant input sizes, but couldn’t pin point which one. It is probably compute_at but I am sure I used compute_at with variable input sizes before, and it worked nicely. Any ideas?