Implement Conv2d using Tensor Core


#1

Motivation

The volta architecture graphic cards are equipped with tensor cores which largely increase the computation power compared with that of pascal architecture graphic cards. The peak performence of tensor core can be as high as 112 TFLOPS(125 TFLOPS with NVlink) which is nearly 8-9 times the performence of pascal architecture graphic cards. Thus I propose add the Cuda Tensor Core API into tvm. What’s more, to implement a convolution layer with tensor core, I believe we should provide tvm the function that allow as to include our own head file when generating cuda kernel.

Action Needed

1.Add new python APIs, so that one can declare and call cuda tensor core related data structure and APIs, they are: wmma::fragment, wmma::load_matrix_sync,wmma::fill_fragment,wmma::mma_sync,wmma::store_matrix_sync.

2.Add a python APIs, that enables one to include specific C/CPP head files. In my case, I need to deal with matrix loading from global memory to shared memory in elaborately designed order.


Implement conv2d with tensor core-2nd question
#2

cc @vinx13 @merrymercy


#3

Calling wmma APIs can be difficult. wmma::fragment is a template class, we declare a fragment first and then call other APIs to load the data. Therefore call_packed doesn’t work here. I think we need a way to generate / insert C++ code snippets to the generated CUDA kernel.


#4

From a high level, this should still be a fit to tensorization right? Can we use tensorize to emit a C++ code snippet here?


#5

From a high level we can define intrinsics, then modify CUDA code generator to generate code that can make use of wmma:fragment etc.


#7

This problem is a little bit trick and can not simply use tensorization. Each wmma API require 32 threads (a warp) to execute from the same address. Tensorize the result will either cause trouble in number of threads or in the value of address.


#8

Adding intrinsics seems to be a solution for both the wmma APIs and User defined C/CPP head files.