[RFC][Tensor Core] Optimization of CNNs on Tensor Core

Hi,

I have tried to tune my conv2d workload [‘NHWC’, (32, 300, 300, 64)], but it failed for cuLaunchKernel’s Grid_dim(2, 4, 90000(>65535)).

bz = s[output].fuse(hi, wi)
s[output].bind(bz, block_z)

It seems like there should be a H/W direction tiling config to support all shapes.

You are right. Thank you for figuring out the bug.

That’s would be my fault that I focused on the classical workload (e.g. resnet), but forgot to test large shapes. It’s easy to fix. Can you please create a PR?

1 Like

Hi, @Hzfengsy @Shawn_Inspur :slightly_smiling_face:

Thanks for your efforts on supporing TensorCore on TVM.

I have tuned TensorCore on classical network such as resnet50 & vgg16(32 batch_size). And the tensor_precision_fu_utilization reported by Nvprof shows that I got a Mid/Low utilization on TensorCore:

   Kernel: fused_nn_conv2d_add_nn_relu_2_kernel0                                                                                               
         2           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization     Mid (4)     Mid (4)     Mid (4)          
   Kernel: fused_nn_softmax_kernel3                                                                                                            
         2           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)          
   Kernel: fused_nn_conv2d_add_nn_relu_3_kernel0                                                                                               
         4           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization     Mid (4)     Mid (4)     Mid (4)          
   Kernel: fused_nn_conv2d_add_nn_relu_4_kernel0                                                                                               
         2           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization     Mid (4)     Mid (4)     Mid (4)          
   Kernel: fused_nn_batch_flatten_kernel0                                                                                                      
         2           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)          
   Kernel: fused_nn_conv2d_add_nn_relu_5_kernel0                                                                                               
         2           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization     Mid (4)     Mid (4)     Mid (4)          
   Kernel: fused_nn_conv2d_add_nn_relu_6_kernel0                                                                                               
         2           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization     Mid (4)     Mid (4)     Mid (4)          
   Kernel: fused_nn_dense_add_kernel0                                                                                                          
         2           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization     Low (2)     Low (2)     Low (2)          
   Kernel: fused_nn_conv2d_add_nn_relu_7_kernel0                                                                                               
         2           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization     Low (3)     Low (3)     Low (3)          
   Kernel: fused_nn_conv2d_add_nn_relu_8_kernel0                                                                                               
         2           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)          
   Kernel: fused_nn_conv2d_add_nn_relu_kernel0                                                                                                 

But when I use cudnn as backend, the utilization is always High.

It seems like that there is still a lot of room for further optimization.Do you have any idea on how to get higher utiliazation for tensor core?

Hi @Novice ,

Yes, I agree that TVM on Tensor Core GPUs do have a lot of room to optimize. Currently we are optimizing the data path between global memory and registers, and we think this is a major bottleneck. We are trying to experiment on different layout of both feature maps and weights. We have found that weights with ‘HWOI’ layout, as suggested by @Hzfengsy, do improve performance for int8 inference on Tensor Core.

Thanks,
Shawn Wu

I am not sure whether my post here is meaningful, but I draw a conclusion from my recent test like this:

  1. fp16 is not always faster than fp32. WIth some model, fp16 is faster but there are also some model that fp32 runs faster than fp16. (both after tuning, each task 2000 trials).

  2. tvm fp16 is slower than tensorrt fp16 inference. On my platform, my model achieves about 35 fps with tvm, but achieves 58 fps with tensorrt.

Do not know what did I miss. Waiting to see the tutorials about the usage of tvm in fp16 mode inference.

By the way, I am willing to share my model, and my test code, if people would like to spend time looking at it.

What’s your platform? I have tested them in nvidia 1660 super, and got a similar conclusion. But in Tesla T4, they are faster by using tensorcore.

I am using T4, but I do not know how to use tensorcore. Is there any option that I can set to true explicitly to use tensorcore?

If you use TVM to compile operators that support tensorcore shape, then tensorcore should be called automatically on T4. So I guess you used some operators that don’t support tensorcore shape (like batch=1)?

Yes, I used batchsize=1, and I compiled my model through onnx, rather than a single operator. I compiled the same model with tensorrt, and the trt speed is much faster than tvm(fp16 mode). Maybe we could wait for more tvm updates and optimizations.

OK, in the case of batch=1, due to the realization of TVM TOPI, the operator cannot be optimized by tensorcore. However, in other libraries (such as TensorRT), in this case, it can use tensorcore through img2col.

You can make your conv2d operators meet these conditions to get the optimization of tensorcore in TVM.

Is it possible that I use batchsize=16 to compile my model, but call the compiled lib with batchsize=? Could I see speed improvement with this method?

I feel it might not work. :disappointed_relieved:

The runtime will expect the input to match the compiled lib. Also, the compiler requires a constant batch size value.

Not sure if using the VM may be of help to you? I have not used it myself and I have the impression that this may not address your problem, but its worth looking into.

Thanks @alopez_13 and @reku for telling me this !! I have read relevant turorials more carefully.

Hi, I am interested in the TVM’s performance of conv2d operator on Tensorcore. I experimented on V100 and T4 platforms using the schedule template in file ‘topi/cuda/cond2d_nhwc_tensorcore’. Results show that AutoTVM never performs better than Cudnn on six commonly used shapes in float16 mode. In some cases (like conv2d_nhwc_32_56_56_256_3_3_64_1_0), AutoTVM’s tuned results only achieve about 50% of Cudnn’s performance. I wonder whether there exist some cases (shape or data layouts) in that AutoTVM performs better than Cudnn? Or can the template be further optimized?

can u tell me how to tune a network with tensorcore on tvm ?thanks