I am learning the usage of tensorize Use Tensorize to Leverage Hardware Intrinsics. In the tutorial it provides an example how to write handcrafted C code and compile it for TVM codegen.
I am wondering if there are similar approaches for CUDA backend? Do we have similar primitives for achieving this? e.g something similar to
s[C].pragma(x, "import_llvm", gemv_impl()) # do we have corresponding annotation(import_llvm) for CUDA ?
I tried a stupid approach. I created a new CUDA file and asked codegen_cuda to always have
Do we have smart solutions here?
Thank you very much! : ) @yzhliu