Tensorize with handcrafted CUDA kernels

Hello all:

I am learning the usage of tensorize Use Tensorize to Leverage Hardware Intrinsics. In the tutorial it provides an example how to write handcrafted C code and compile it for TVM codegen.

I am wondering if there are similar approaches for CUDA backend? Do we have similar primitives for achieving this? e.g something similar to

s[C].pragma(x, "import_llvm", gemv_impl()) # do we have corresponding annotation(import_llvm) for CUDA ? 

I tried a stupid approach. I created a new CUDA file and asked codegen_cuda to always have

#include <myfile.cu>

Do we have smart solutions here?

Thank you very much! : ) @yzhliu