Implement conv2d with tensor core-2nd question

In previous RFC (Implement Conv2d using Tensor Core), I expressed my intention of implement a convolution layer with tensor core in tvm. By adding new intrinsic about the wmma APIs, now I get rid of all these APIs in my own head file(I used to run my program by hacking into codegen_cuda.cc and add one Cpp head file.) However, I still need to integrate 3 more function into tvm, they are used for efficiently**1. move data from global memory to shared memory 2.perform the task of im2col 3.move computation result back from shared memory to global memory.**These 3 functions are very complicated and can hardly be written using schedule or ir, while they are also efficient, my single layer conv2d performance is 90%-130% compared with convolution of tensor-rt(also using tensor core) .

However, these functions are all device function and are now called in my cuda kernel conducting the convolution, thus I can not simply integrate them into packed functions(because they are not host functions). Intrinsic functions may also not be a good idea, because that will result in several hundred lines of string in the codegen_c file. The best possible way I can think of is adding a flag in the builder, and we put this headfile in a specific folder inside tvm. Once the flag is set to true, the codegen_cuda.cc file will include my head file into the generated code.(This is also a very dum idea, I guess, anyway)
@vinx13 @tqchen

cc @merrymercy @masahi

I think this method is okay.

We do something similar for int8 and fp16 header files.