Optimizing matrix multiplication for GPU

I am following the example here: https://docs.tvm.ai/tutorials/autotvm/tune_simple_template.html to write some form of matrix multiplication and run it on GPU. Here are two observations that I need help to explain:

  • Sizes of the input matrices is NxL and LxM. If the size L is a constant known durning compilation, this gives a 1.25x faster code than using L as a variable, even though generated code looks the same and the value of L didn’t change.

  • The tutorial suggested using s[Z].reorder(i_outer, j_outer, k, i_inner, j_inner). I understand why that might be useful, but in practice, not having reorder at all works better than this. Also, I don’t understand why k comes before i_inner, j_inner not after.

  • This other tutorial (https://docs.tvm.ai/tutorials/optimize/opt_gemm.html#sphx-glr-tutorials-optimize-opt-gemm-py) discusses more optimization approaches. I am curious what are the ones that are applicable to GPUs (the tutorial focuses on CPU)

Thanks

I think that tutorial doesn’t really push the performance since it was mainly focusing on how to create a tuning space. For example, the throughput shown in the log is just 10+ GFlop/s, which is far away from what GEMM should have. Maybe that’s also why constant shape doesn’t bring too much speedup.

The best way to understand how to use TVM to optimize GEMM on GPU, in my opinion, is to read the TOPI scheduling implementation here. You might be interested in the function schedule_dense_small_batch and schedule_dense_large_batch.

1 Like

Very cool. Thanks @comaniac

@comaniac, so I followed schedule_dense_small_batch to implement batched matrix multiplication, and it gave a nice speedup. However, it is still 10x-15x slower than PyTorch’s torch.bmm depending on the GPU.

Here’s the full code:

Any suggestions what I might be missing?

Thanks

Did you compare the performance with small batch? I remember the optimization applied to that schedule was also limited. If yours is better than that schedule in TOPI, then you are eligible to help us improve the quality of the TOPI schedules :slight_smile:

How would you do that? I tried the following

    bsz = te.var('bsz')
    d1 = te.var('d1')
    d2 = te.var('d2')
    d3 = te.var('d3')
    A = te.placeholder((bsz, d1, d3), name='A', dtype='float32')  # first tensor
    B = te.placeholder((bsz, d2, d3), name='B', dtype='float32')  # second tensor
    R = topi.nn.batch_matmul(A, B)
    with tvm.target.cuda():
        s = topi.cuda.schedule_reduce(R)

but got

Traceback (most recent call last):

  File "scripts/band_mm_minimal.py", line 94, in <module>
    mm_fun_topi = _compile_function_topi()

  File "scripts/band_mm_minimal.py", line 63, in _compile_function_topi
    s = topi.cuda.schedule_reduce(R)

  File "/usr/tvm/topi/python/topi/cuda/reduction.py", line 147, in schedule_reduce
    traverse_after_reduce(outs[0].op)

  File "/usr/tvm/topi/python/topi/cuda/reduction.py", line 143, in traverse_after_reduce
    raise RuntimeError("Unsupported operator: %s" % operator.tag)

RuntimeError: Unsupported operator: batch_matmul

Which schedule should I use?

And s = topi.cuda.batch_matmul.schedule_batch_matmul(R) is 280x slower than PyTorch.

    bsz = te.var('bsz')
    d1 = te.var('d1')
    d2 = te.var('d2')
    d3 = te.var('d3')
    A = te.placeholder((bsz, d1, d3), name='A', dtype='float32')  # first tensor
    B = te.placeholder((bsz, d2, d3), name='B', dtype='float32')  # second tensor
    R = topi.nn.batch_matmul(A, B)
    with tvm.target.cuda():
        s = topi.cuda.batch_matmul.schedule_batch_matmul(R)

Most of the ops so far are specialized for constant dimensions. So in this case we will need to put in constant dimensions(instead of te.var). Because batch_matmul itself is compute intensive, we might also benefit from AutoTVM, instead of using the default one.

Note that specializations necessary for this type of kernels, this is also what cudnn will do(which pytorch calls under the hood). The easiest way to make use such kernel is to generate a few candidate and and fall back to a normal kernel when the shape does not match. There are also more efforts in the community to work on automatic bucketing etc.

Thanks @tqchen, this was helpful. Using constants as follows:

    bsz = 24
    d1 = 16384
    d2 = 512
    d3 = 64
    A = te.placeholder((bsz, d1, d3), name='A', dtype='float32')  # first tensor
    B = te.placeholder((bsz, d2, d3), name='B', dtype='float32')  # second tensor
    R = topi.nn.batch_matmul(A, B)
    s = topi.cuda.batch_matmul.schedule_batch_matmul(R)

makes it 5x faster, which is still 3x slower than PyToch, but much better. So,

  1. How do I improve this further? use AutoTVM? I have tried it before for custom schedules that I implemented, but not sure how to do that for schedule_batch_matmul.

  2. More importantly, in my application, d2, d3 are constants, but bsz, d1 are not; bsz has 16 possible values, and d1 has 32 possible values, making it a total of 512 configuration. Does that mean I need to compile 512 different version of the code? there must be a better solution. Maybe the schedule can be updated to support range of values instead of constants?

others might be able to jump in to comment about running autotvm, which will certainly help.

The easiest way to make use such kernel is to generate a few candidate and and fall back to a normal kernel when the shape does not match. There are also more efforts in the community to work on automatic bucketing etc.

The above comment(about generating a few key ones and fallback to normal kernel) is a quick work around for the constant constraints. I agree that we will need better solutions for dynamic workloads, there are some ongoing efforts, see dynamic workload presos in https://tvmconf.org/#about-tvmconf

generating a few key ones and fallback to normal kernel

The 16 values for bsz, and 32 values for d1 are the key ones, e.g, d1 ranges from 1 to 16384, but I can pad it to multiples of 512.

@comaniac, do you know about this? how would you optimizer topi.cuda.batch_matmul.schedule_batch_matmul(R) using autotvm?

I just checked it out and unfortunately batch_matmul schedulefor CUDA does not have a tuning space (https://github.com/apache/incubator-tvm/blob/master/topi/python/topi/cuda/batch_matmul.py). A schedule with a tuning space should have a set of knobs defined in the schedule, such as this. You are very welcome to add a tuning space to topi.cuda.batch_matmul.schedule_batch_matmul.

After a tuning space has been created, you could follow this tutorial to register the schedule function as an AutoTVM task and tune it to see the performance improvement.

1 Like

Thanks @comaniac.

Do you know what params in schedule_batch_matmul need tuning? probably the 16s and the 64s?

Also, do you have a high level description of that schedule how it is splitting the workload?

Sorry I didn’t dive into batch_matmul schedule function so I have no idea. You could do some experiments and file a PR once you identified a tuning space.

I didn’t quite get your second question. Are you asking how define_split works?

sorry the question wasn’t clear. I was wondering if there’s documentation or high level description for the schedule (no comments in the code), but I guess the answer is no.

Sorry as you expected we don’t have that documents. All schedules were developed and committed by contributors. Since end-users are not supposed to read the schedule functions, how the schedule function was developed is not documented. The similar tutorials you can find is like this one.

1 Like

@comaniac, hopefully last question, do you know how to save multiple modules into one .so file? I want to save multiple versions of the function as @tqchen suggested, but tvm.runtime.export_library saves only one, and tvm.cc.create_shared doesn’t link the host code with the target code as tvm.runtime.export_library does. I am sure this can be done with gcc or something, but was wondering if tvm already has a solution for this given that it is a common use case.

AFAIK you can only keep one module in one .so file. If you want to keep multiple versions to deal with dynamic workloads, it’s better to wait for the corresponding support by @haichen. Otherwise, you can only integrate all versions to a single schedule function for now.

Given that it is not easy to get multiple versions of the function compiled together, I tried to see if schedule_batch_matmul can be changed to work for dynamic input sizes (at the expense of being slightly less efficient). It seem that this line and the one above can be changed to constants y_bn = x_bn = 64. However, when I do so, I get the following error:

TVMError: Check failed: condition_counter() == 0 (1 vs. 0) : Cannot insert syncs inside condition

The full codegen and the error are here. Any suggestions what could be wrong? my guess is that some of the scheduling primitives are still relying on constant input sizes, but couldn’t pin point which one. It is probably compute_at but I am sure I used compute_at with variable input sizes before, and it worked nicely. Any ideas?