How to implement Conv2d for ResNet-50 in TVM

I am trying to measure TVMs performance for the conv2d operator for the following 9 input sizes, as used in ResNet-50:

[batch, in_height, in_width, in_channels], [filter_height, filter_width, in_channels, out_channels]
[128    16         16        32]           [3              3             32           32]
[128    16         16        32]           [3              3             32           32]
[128    8          8         64]           [3              3             64           64]
[128    32         32        3]            [3              3             3            16]
[128    34         34        16]           [3              3             16           32]
[128    32         32        16]           [1              1             16           16]
[128    18         18        32]           [3              3             32           64]
[128    32         32        16]           [1              1             16           32]
[128    16         16        32]           [1              1             32           64]

I found the following tutorial on how to implement and tune conv2d for NVIDIA GPUs: https://docs.tvm.ai/tutorials/autotvm/tune_conv2d_cuda.html. Unfortunately, the tutorial code does not support batch sizes larger than 1 and stores the input buffer in the layout

[batch, in_channels, in_height, in_width], [out_channels, in_channels, filter_height, filter_width]

instead of

[batch, in_height, in_width, in_channels], [filter_height, filter_width, in_channels, out_channels]

Is it possible to adapt the tutorial code to match ResNet-50’s batch size and buffer layout? Additionally, is this the best available conv2d implementation that should be used for comparisons with TVM?

@tqchen @thierry Do you have any ideas how to solve this issue?

The best way to run comparison would still be running the end to end benchmarks, see e.g. https://github.com/dmlc/tvm/blob/master/apps/benchmark/gpu_imagenet_bench.py

@tqchen: thank you, but right now, we are interested in TVM’s performance on the operator level. Any suggestions?

The end to end code did use operator code generated by autotvm. So if you trace the code that is being run, likely you can find these operators