I’m trying to add TVM as an inference runtime for SimpleDet.
Different from what Wuwei Lin have done, I have managed to implement all needed operators like proposal and decode bbox in pure nnvm operators to free TVM developers from paying special attention to detection tasks.
Now Faster R-CNN, Mask R-CNN, FPN, RetinaNet all running on both cuda and llvm backends correctly(almost identical APs as in MXNet).
But there are some performance issues I’d like to discuss with the TVM community before rolling out publicly.
1.The 7x7 conv for an input of shape 1x3x800x1200 counts for 9G FLOPs, and after autotvm I get a kernel runs at 7T GLOPs. I assume this conv should runs less than 1ms, but quite surprisingly it takes 20ms. This almost counts for half of the runtime of backbone. Running twice the whole modules give an even worse 70ms.
2.Argsort takes 455ms to run on a 56k elements input. Is there anyone looking at this, if not I am willing to contribute.
3.Convs with large batch size and small spatial dims like (1000, 2048, 7, 7) consumes 2X times compared with cudnn. TVM take 20ms while cudnn 9ms.
Looking forward to some advice and suggestions.
Node Name Ops Time(us) Time(%) Shape Inputs Outputs
--------- --- -------- ------- ----- ------ -------
fused_nn_conv2d_add_nn_relu_10 fused_nn_conv2d_add_nn_relu_10 21353.2 2.964 (1, 64, 600, 400) 3 1
fused_nn_max_pool2d fused_nn_max_pool2d 225.366 0.031 (1, 64, 300, 200) 1 1
fused_nn_conv2d_add_nn_relu_9 fused_nn_conv2d_add_nn_relu_9 130.447 0.018 (1, 64, 300, 200) 3 1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_3 fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_3 824.072 0.114 (1, 64, 300, 200) 3 1
fused_nn_conv2d_add fused_nn_conv2d_add 433.473 0.06 (1, 256, 300, 200) 3 1
fused_nn_conv2d_add_add_nn_relu_3 fused_nn_conv2d_add_add_nn_relu_3 660.886 0.092 (1, 256, 300, 200) 4 1
fused_nn_conv2d_add_nn_relu_8 fused_nn_conv2d_add_nn_relu_8 352.184 0.049 (1, 64, 300, 200) 3 1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_31 fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_3 825.523 0.115 (1, 64, 300, 200) 3 1
fused_nn_conv2d_add_add_nn_relu_31 fused_nn_conv2d_add_add_nn_relu_3 665.566 0.092 (1, 256, 300, 200) 4 1
fused_nn_conv2d_add_nn_relu_81 fused_nn_conv2d_add_nn_relu_8 350.198 0.049 (1, 64, 300, 200) 3 1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_32 fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_3 822.122 0.114 (1, 64, 300, 200) 3 1
fused_nn_conv2d_add_add_nn_relu_32 fused_nn_conv2d_add_add_nn_relu_3 667.212 0.093 (1, 256, 300, 200) 4 1
fused_nn_conv2d_add_nn_relu_7 fused_nn_conv2d_add_nn_relu_7 612.872 0.085 (1, 128, 300, 200) 3 1
fused_nn_conv2d_add_nn_relu_6 fused_nn_conv2d_add_nn_relu_6 721.025 0.1 (1, 128, 150, 100) 3 1
fused_nn_conv2d_add_1 fused_nn_conv2d_add_1 782.087 0.109 (1, 512, 150, 100) 3 1
fused_nn_conv2d_add_add_nn_relu_2 fused_nn_conv2d_add_add_nn_relu_2 544.516 0.076 (1, 512, 150, 100) 4 1
fused_nn_conv2d_add_nn_relu_5 fused_nn_conv2d_add_nn_relu_5 378.919 0.053 (1, 128, 150, 100) 3 1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_2 fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_2 586.879 0.081 (1, 128, 150, 100) 3 1
fused_nn_conv2d_add_add_nn_relu_21 fused_nn_conv2d_add_add_nn_relu_2 545.552 0.076 (1, 512, 150, 100) 4 1
fused_nn_conv2d_add_nn_relu_51 fused_nn_conv2d_add_nn_relu_5 378.946 0.053 (1, 128, 150, 100) 3 1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_21 fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_2 584.278 0.081 (1, 128, 150, 100) 3 1
fused_nn_conv2d_add_add_nn_relu_22 fused_nn_conv2d_add_add_nn_relu_2 538.901 0.075 (1, 512, 150, 100) 4 1
fused_nn_conv2d_add_nn_relu_52 fused_nn_conv2d_add_nn_relu_5 377.892 0.052 (1, 128, 150, 100) 3 1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_22 fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_2 583.464 0.081 (1, 128, 150, 100) 3 1
fused_nn_conv2d_add_add_nn_relu_23 fused_nn_conv2d_add_add_nn_relu_2 535.548 0.074 (1, 512, 150, 100) 4 1
fused_nn_conv2d_add_nn_relu_4 fused_nn_conv2d_add_nn_relu_4 651.389 0.09 (1, 256, 150, 100) 3 1
fused_nn_conv2d_add_nn_relu_3 fused_nn_conv2d_add_nn_relu_3 765.693 0.106 (1, 256, 75, 50) 3 1
fused_nn_conv2d_add_2 fused_nn_conv2d_add_2 837.883 0.116 (1, 1024, 75, 50) 3 1
fused_nn_conv2d_add_add_nn_relu_1 fused_nn_conv2d_add_add_nn_relu_1 411.432 0.057 (1, 1024, 75, 50) 4 1
fused_nn_conv2d_add_nn_relu_2 fused_nn_conv2d_add_nn_relu_2 429.464 0.06 (1, 256, 75, 50) 3 1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_1 fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_1 441.932 0.061 (1, 256, 75, 50) 3 1
fused_nn_conv2d_add_add_nn_relu_11 fused_nn_conv2d_add_add_nn_relu_1 409.714 0.057 (1, 1024, 75, 50) 4 1
fused_nn_conv2d_add_nn_relu_21 fused_nn_conv2d_add_nn_relu_2 427.215 0.059 (1, 256, 75, 50) 3 1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_11 fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_1 437.224 0.061 (1, 256, 75, 50) 3 1
fused_nn_conv2d_add_add_nn_relu_12 fused_nn_conv2d_add_add_nn_relu_1 409.383 0.057 (1, 1024, 75, 50) 4 1
fused_nn_conv2d_add_nn_relu_22 fused_nn_conv2d_add_nn_relu_2 426.843 0.059 (1, 256, 75, 50) 3 1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_12 fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_1 439.42 0.061 (1, 256, 75, 50) 3 1
fused_nn_conv2d_add_add_nn_relu_13 fused_nn_conv2d_add_add_nn_relu_1 409.185 0.057 (1, 1024, 75, 50) 4 1
fused_nn_conv2d_add_nn_relu_23 fused_nn_conv2d_add_nn_relu_2 427.26 0.059 (1, 256, 75, 50) 3 1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_13 fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_1 438.023 0.061 (1, 256, 75, 50) 3 1
fused_nn_conv2d_add_add_nn_relu_14 fused_nn_conv2d_add_add_nn_relu_1 409.543 0.057 (1, 1024, 75, 50) 4 1
fused_nn_conv2d_add_nn_relu_24 fused_nn_conv2d_add_nn_relu_2 427.47 0.059 (1, 256, 75, 50) 3 1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_14 fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_1 438.185 0.061 (1, 256, 75, 50) 3 1
fused_nn_conv2d_add_add_nn_relu_15 fused_nn_conv2d_add_add_nn_relu_1 408.131 0.057 (1, 1024, 75, 50) 4 1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_4 fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_4 2584.37 0.359 (1, 512, 75, 50) 3 1
fused_nn_conv2d_add_3 fused_nn_conv2d_add_3 73.911 0.01 (1, 30, 75, 50) 3 1
fused_reshape_3 fused_reshape_3 12.871 0.002 (1, 2, 15, 75, 50) 1 1
fused_nn_softmax_1 fused_nn_softmax_1 23.772 0.003 (1, 2, 15, 75, 50) 1 1
fused_reshape_2 fused_reshape_2 11.936 0.002 (1, 30, 75, 50) 1 1
fused_arange_reshape_broadcast_to_reshape_slice_like_broadcast_to_reshape_stride_10214900708559545960_ fused_arange_reshape_broadcast_to_reshape_slice_like_broadcast_to_reshape_stride_10214900708559545960_ 11.777 0.002 (1, 56250) 2 1
fused_argsort fused_argsort 445590.0 61.855 (1, 56250) 1 1
fused_nn_conv2d_add_4 fused_nn_conv2d_add_4 79.326 0.011 (1, 60, 75, 50) 3 1
fused_expand_dims_split fused_expand_dims_split 23.175 0.003 (1, 1, 1) 1 3
fused_expand_dims_split fused_expand_dims_split 23.175 0.003 (1, 1, 1) 1 3
fused_expand_dims_split fused_expand_dims_split 23.175 0.003 (1, 1, 1) 1 3
fused_arange_reshape_broadcast_to_reshape_strided_slice_reshape_stack_gather_nd__1662028809018444607_ fused_arange_reshape_broadcast_to_reshape_strided_slice_reshape_stack_gather_nd__1662028809018444607_ 18.633 0.003 (1, 6000, 6) 8 1
fused_vision_get_valid_counts fused_vision_get_valid_counts 37.117 0.005 (1,) 1 2
fused_vision_get_valid_counts fused_vision_get_valid_counts 37.117 0.005 (1, 6000, 6) 1 2
fused_vision_non_max_suppression fused_vision_non_max_suppression 6980.58 0.969 (1, 6000, 6) 2 1
fused_strided_slice_strided_slice fused_strided_slice_strided_slice 13.944 0.002 (1, 1000, 4) 1 1
fused_vision_roi_align fused_vision_roi_align 3449.59 0.479 (1, 1000, 1024, 7, 7) 2 1
fused_reshape_1 fused_reshape_1 1143.54 0.159 (1000, 1024, 7, 7) 1 1
fused_nn_conv2d_add_nn_relu_1 fused_nn_conv2d_add_nn_relu_1 9089.74 1.262 (1000, 512, 7, 7) 3 1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu 20138.2 2.796 (1000, 512, 7, 7) 3 1
fused_nn_conv2d_add_5 fused_nn_conv2d_add_5 36522.4 5.07 (1000, 2048, 7, 7) 3 1
fused_nn_conv2d_add_add_nn_relu fused_nn_conv2d_add_add_nn_relu 20964.4 2.91 (1000, 2048, 7, 7) 4 1
fused_nn_conv2d_add_nn_relu fused_nn_conv2d_add_nn_relu 17931.1 2.489 (1000, 512, 7, 7) 3 1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu1 fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu 21757.3 3.02 (1000, 512, 7, 7) 3 1
fused_nn_conv2d_add_add_nn_relu1 fused_nn_conv2d_add_add_nn_relu 21441.5 2.976 (1000, 2048, 7, 7) 4 1
fused_nn_conv2d_add_nn_relu1 fused_nn_conv2d_add_nn_relu 17789.9 2.47 (1000, 512, 7, 7) 3 1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu2 fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu 21733.7 3.017 (1000, 512, 7, 7) 3 1
fused_nn_conv2d_add_add_nn_relu_cast fused_nn_conv2d_add_add_nn_relu_cast 21234.0 2.948 (1000, 2048, 7, 7) 4 1
fused_nn_global_avg_pool2d fused_nn_global_avg_pool2d 6932.16 0.962 (1000, 2048, 1, 1) 1 1
fused_nn_batch_flatten fused_nn_batch_flatten 53.207 0.007 (1000, 2048) 1 1
fused_nn_dense_add fused_nn_dense_add 1023.42 0.142 (1000, 81) 3 1
fused_nn_softmax fused_nn_softmax 27.016 0.004 (1000, 81) 1 1
fused_reshape fused_reshape 12.118 0.002 (1, 1000, 81) 1 1
fused_nn_dense_add_1 fused_nn_dense_add_1 127.332 0.018 (1000, 8) 3 1
fused_reshape_strided_slice_split_multiply_add_strided_slice_split_subtract_add__8298732747436919636_ fused_reshape_strided_slice_split_multiply_add_strided_slice_split_subtract_add__8298732747436919636_ 14.025 0.002 (1, 1000, 4) 5 1
Tuning...
[Task 1/25] Current/Best: 4431.72/5716.17 GFLOPS | Progress: (1960/2000) | 2297.28 s Done.
[Task 2/25] Current/Best: 3701.48/4152.80 GFLOPS | Progress: (1288/2000) | 1271.70 s Done.
[Task 3/25] Current/Best: 10083.22/14979.71 GFLOPS | Progress: (1960/2000) | 2623.48 s Done.
[Task 4/25] Current/Best: 2030.68/2479.63 GFLOPS | Progress: (1120/2000) | 1155.58 s Done.
[Task 5/25] Current/Best: 1665.80/5162.81 GFLOPS | Progress: (1008/2000) | 1047.14 s Done.
[Task 6/25] Current/Best: 2229.39/5461.95 GFLOPS | Progress: (1400/2000) | 1565.77 s Done.
[Task 7/25] Current/Best: 5482.10/7710.09 GFLOPS | Progress: (1288/2000) | 1476.95 s Done.
[Task 8/25] Current/Best: 4041.71/4630.37 GFLOPS | Progress: (1120/2000) | 1236.02 s Done.
[Task 9/25] Current/Best: 4941.29/6309.03 GFLOPS | Progress: (840/2000) | 959.14 s Done.
[Task 10/25] Current/Best: 5063.09/5526.73 GFLOPS | Progress: (1176/2000) | 1629.97 s Done.
[Task 11/25] Current/Best: 4385.40/5186.10 GFLOPS | Progress: (1064/2000) | 1202.06 s Done.
[Task 12/25] Current/Best: 6342.40/7178.71 GFLOPS | Progress: (784/2000) | 891.74 s Done.
[Task 13/25] Current/Best: 5164.69/6910.44 GFLOPS | Progress: (1736/2000) | 2001.48 s Done.
[Task 14/25] Current/Best: 4362.72/6038.71 GFLOPS | Progress: (952/2000) | 1018.48 s Done.
[Task 15/25] Current/Best: 7109.94/8156.93 GFLOPS | Progress: (840/2000) | 1088.19 s Done.
[Task 16/25] Current/Best: 1502.97/5858.54 GFLOPS | Progress: (1008/2000) | 1096.25 s Done.
[Task 17/25] Current/Best: 4842.55/6618.93 GFLOPS | Progress: (616/2000) | 670.73 s Done.
[Task 18/25] Current/Best: 3586.62/6484.18 GFLOPS | Progress: (1288/2000) | 1459.45 s Done.
[Task 19/25] Current/Best: 3652.60/5096.29 GFLOPS | Progress: (840/2000) | 896.95 s Done.
[Task 20/25] Current/Best: 7877.72/10661.84 GFLOPS | Progress: (672/2000) | 890.28 s Done.
[Task 21/25] Current/Best: 4341.54/5744.97 GFLOPS | Progress: (672/2000) | 725.49 s Done.
[Task 22/25] Current/Best: 4031.10/5697.96 GFLOPS | Progress: (840/2000) | 849.97 s Done.
[Task 23/25] Current/Best: 5242.28/6344.03 GFLOPS | Progress: (1064/2000) | 1133.34 s Done.
[Task 24/25] Current/Best: 9742.05/11307.57 GFLOPS | Progress: (1064/2000) | 1567.20 s Done.
[Task 25/25] Current/Best: 4408.99/5533.57 GFLOPS | Progress: (672/2000) | 672.80 s Done.