[TE][CUDA] Issues with high perf GEMM api

@tqchen following up from https://github.com/apache/incubator-tvm/pull/4270 the following gist is the example that motivated this. Error I get from this is “TVMError: Not all Vars are passed in api_args: ‘threadIdx.x’ does not appear in api_args”.

There seemingly are opportunities to fall into situations where the inference bound will “pass back” a domain of [threadIdx.x, 1] however this can cause issues when the attr statement comes later in the IR. This can actually show up as 2 errors, the first is above, if that is silenced a later use before defined error will pop up.

The GEMM structure in the gist has the basic structure of a high performing GEMM. We would want to do vector loads from the input matrices as we reduce along K, reduce that vector, then do a thread all reduce. The other thing I haven’t been able to get functioning though is cacheing the block tile in shared memory before starting the computation. There’s quite a bit of opportunity for data reuse so it’s quite a waste to not be able to directly cache this in shared memory. You can see the last few cache_read statements.

I am able to get these examples working with some pretty invasive modifications that I’m not terribly happy with, but if anyone is curious you can find them here https://github.com/csarofeen/tvm/tree/bounds_changes

Thanks @csarofeen! it would also be great if you can paste the generated cuda code for reference. cc @vinx13 @Hzfengsy can you also take a look if the result pattern can be achieved through alternative ways?

for reference, here is an implementation for efficient transposed gemm https://github.com/apache/incubator-tvm/blob/master/topi/recipe/gemm/cuda_gemm_square.py

I’m sorry I can’t get your idea. Is that an improvement over the current gemm schedule which is shown above? It would help a lot if you could provide your Cuda code (if having trouble in generating in tvm, can you please write it as you want it to be) Thank you!

I’m afraid the CUDA code will make the conversation more complicated rather than easier, as it includes other concepts that aren’t expressible today in TVM. However, if you want to hear about GEMM design we could do VC and go through a few concepts.

There are 2 big differences I see between the code you wrote and what I was trying to achieve. First you’re working on tensors where the reduced dimension is the outer most, this makes loading of data more straight forward. The dimension of high data reuse is the inner most dimension. When instead we have a matrix that comes in as [m, k] where k is the reduction dimension what we would like to do is to load efficiently in the k dimension with a minimal size. Also when we store this matrix that’s [m, k] in shared memory, we would like it to be transposed to [k, m] as our threads are going to be going across the m dimension and we don’t want bank conflicts in shared memory.

The other big difference is what we refer to as split-k. If m and n are small enough that it is difficult to get good parallelization on those dimensions alone for the GPU we start parallelizing the k dimension in the block. Think about the degenerate case of a dot product, we wouldn’t only have a single thread reducing that dimension.

The last thing that we ideally would like to do is have vectorized loads/stores to/from smem/gmem of either float4/half8…etc This is another reason I used rfactor as I did. This allows us to do partial accumulation over the vectorized values, then reduce it at a later state.

I hope this helps bring some clarity, though I’d also be happy to chat over VC if that would help.