I am trying to measure TVMs performance for the conv2d operator for the following 9 input sizes, as used in ResNet-50:
[batch, in_height, in_width, in_channels], [filter_height, filter_width, in_channels, out_channels]
[128 16 16 32] [3 3 32 32]
[128 16 16 32] [3 3 32 32]
[128 8 8 64] [3 3 64 64]
[128 32 32 3] [3 3 3 16]
[128 34 34 16] [3 3 16 32]
[128 32 32 16] [1 1 16 16]
[128 18 18 32] [3 3 32 64]
[128 32 32 16] [1 1 16 32]
[128 16 16 32] [1 1 32 64]
I found the following tutorial on how to implement and tune conv2d for NVIDIA GPUs: https://docs.tvm.ai/tutorials/autotvm/tune_conv2d_cuda.html. Unfortunately, the tutorial code does not support batch sizes larger than 1 and stores the input buffer in the layout
[batch, in_channels, in_height, in_width], [out_channels, in_channels, filter_height, filter_width]
instead of
[batch, in_height, in_width, in_channels], [filter_height, filter_width, in_channels, out_channels]
Is it possible to adapt the tutorial code to match ResNet-50’s batch size and buffer layout? Additionally, is this the best available conv2d implementation that should be used for comparisons with TVM?