Hijacking CUDA kernel launch parameters from within Graph Runtime

Hi

I am working on torchvision models which are compiled to cuda and llvm(host) using relay.build() in python.

I am trying to attach 3 additional arguments to the void_args block of memory that is passed to cuLaunchKernel in tvm/runtime/cuda_module.cc for each of the kernels.

I have already modified the kernels by compiling them using clang++ with a custom pass from within tvm_callback_cuda_compile, so the kernels have 3 additional arguments + small block of code added to them by the time the relay.build() stage finishes.

I’ve figured the next step would be to pass the values of the aruments to these kernels. I’ve added some code that handles that in GraphRuntime::CreateTVMOp, which in essence push_back’s the values onto arg_ptr->arg_values. What I didn’t account for is the fact that these kernel launches are first called from within “llvm” host code which configures the launch and passes arguments.

Whenever I try to run the GraphRuntime ops, I get an assertion error: TVMError: Check failed: ret == 0 (-1 vs. 0) : Assert fail: (num_args == 4), fused_nn_conv2d_add_nn_relu_3: num_args should be 4, which would make sense as the number of args passed to llvm hostcode for each op is checked in the llvm host code.

My question is: Is there a way to do this without massively modifying the llvm host code generation phase in llvm_module.cc etc.? Ideally, I’d like to pass some arguments to GraphRuntime in C++ from within Python and be able to add them to the kernel launches.

I cannot simply append new arguments to void_args as I don’t know their size for each kernel to rebuild them so that’s kind of out the question.

Any ideas would be greatly appreciated!

It would be great to elaborate what are the additional parameters you want to pass. If you have additional parameters to be passed to the kernel, ideally they should be part of your parameter list of the tir PrimFunc already.

It is a too late to do such kind of modification in CUDA compilation phase. But what you could do is to rewrite the TIR PrimFunc at a late stage of the lowering(before host device split) to append the parameters you want to the function and additional fragment as an external function call(with these parameters)

Then the compilation flow should automatically take care of the rest of the things.

Yes! This worked for me. I’ve also modified GraphRuntime a little to force-inject the parameters as the Op’s were constructed. The parameter I wanted to pass to each kernel was basically a buffer that I wanted to collect some data in as the kernel progressed.

Not really my field of expertise but could this be a feature for a future release? I.e. implemented as a pass just before split_dev_host? As far as I saw in the pass stack, there isn’t anything later on that modifies the number or datatype of the args to functions but I might be wrong. I guess as long as the modification is a “sound” CUDA / PTX code then it should work for any kernel?

It is harder to do so in a later stage as the caller callee split happens in split host device and the additional parameters need to be stream through in both the host func, the call from the host to the device and the device function interface.

So inserting the code before Split host device is the recommended approach here. If there is something we can formalize: eg performance counter collection, we can consider implement some of the pass as optional ones.

1 Like