Hello TVM developers:
I am recently working on CUDA-related optimization and code generator on TVM stack.
As we know, for GPU application, memory accesses are the most common and important performance constraints.
TVM already has CUDA code generator and scheduler for us, which is great.
I am thinking about for some applications, CUDA codes generated by TVM is not optimal especially for the piece of data transfer part. I would like to replace memory-access-related part with handcrafted CUDA device function. handcrafted CUDA codes provide more freedom and we can change the order of threads and make sure sequential accesses are adjacent. (I.e.
shared_mem[thread.id+offset_local] = global_mem[thread.id + offset_global]
). This handcrafted coalescing access pattern should be more efficient.
For instance.
We have CUDA kernel generated by TVM
extern "C" __global__ void kernel(void* __restrict__ Input, float* __restrict__ Output){
__shared__ float shared_mem[A];
....
// code generated by TVM
// load data to shared memory
shared_mem[thread_x] = Input[y]
//do computation based on data in shared_mem, and store back global memory
Output = shared_mem[thread_x] * 123
}
And what I want to do is to replace the automatic generated part
shared_mem[thread_x] = Input[y]
with
handcrafted_read(shared_mem,Input)
where handcrafted_read is defined as
__device__ void handcrafted_read(float* shared_buffer,void* data_to_be_read)
{
// handcrafted codes here
}
Could you give me some tips/directions how to implement this?
Thanks a lot!