Conv with 7x7 kernel runs considerablely slower than the tuned FLOPs

I’m trying to add TVM as an inference runtime for SimpleDet.
Different from what Wuwei Lin have done, I have managed to implement all needed operators like proposal and decode bbox in pure nnvm operators to free TVM developers from paying special attention to detection tasks.
Now Faster R-CNN, Mask R-CNN, FPN, RetinaNet all running on both cuda and llvm backends correctly(almost identical APs as in MXNet).

But there are some performance issues I’d like to discuss with the TVM community before rolling out publicly.

1.The 7x7 conv for an input of shape 1x3x800x1200 counts for 9G FLOPs, and after autotvm I get a kernel runs at 7T GLOPs. I assume this conv should runs less than 1ms, but quite surprisingly it takes 20ms. This almost counts for half of the runtime of backbone. Running twice the whole modules give an even worse 70ms.

2.Argsort takes 455ms to run on a 56k elements input. Is there anyone looking at this, if not I am willing to contribute.

3.Convs with large batch size and small spatial dims like (1000, 2048, 7, 7) consumes 2X times compared with cudnn. TVM take 20ms while cudnn 9ms.

Looking forward to some advice and suggestions.

Node Name                                                                                               Ops                                                                                                     Time(us)  Time(%)  Shape                  Inputs  Outputs
---------                                                                                               ---                                                                                                     --------  -------  -----                  ------  -------
fused_nn_conv2d_add_nn_relu_10                                                                          fused_nn_conv2d_add_nn_relu_10                                                                          21353.2   2.964    (1, 64, 600, 400)      3       1
fused_nn_max_pool2d                                                                                     fused_nn_max_pool2d                                                                                     225.366   0.031    (1, 64, 300, 200)      1       1
fused_nn_conv2d_add_nn_relu_9                                                                           fused_nn_conv2d_add_nn_relu_9                                                                           130.447   0.018    (1, 64, 300, 200)      3       1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_3                                 fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_3                                 824.072   0.114    (1, 64, 300, 200)      3       1
fused_nn_conv2d_add                                                                                     fused_nn_conv2d_add                                                                                     433.473   0.06     (1, 256, 300, 200)     3       1
fused_nn_conv2d_add_add_nn_relu_3                                                                       fused_nn_conv2d_add_add_nn_relu_3                                                                       660.886   0.092    (1, 256, 300, 200)     4       1
fused_nn_conv2d_add_nn_relu_8                                                                           fused_nn_conv2d_add_nn_relu_8                                                                           352.184   0.049    (1, 64, 300, 200)      3       1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_31                                fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_3                                 825.523   0.115    (1, 64, 300, 200)      3       1
fused_nn_conv2d_add_add_nn_relu_31                                                                      fused_nn_conv2d_add_add_nn_relu_3                                                                       665.566   0.092    (1, 256, 300, 200)     4       1
fused_nn_conv2d_add_nn_relu_81                                                                          fused_nn_conv2d_add_nn_relu_8                                                                           350.198   0.049    (1, 64, 300, 200)      3       1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_32                                fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_3                                 822.122   0.114    (1, 64, 300, 200)      3       1
fused_nn_conv2d_add_add_nn_relu_32                                                                      fused_nn_conv2d_add_add_nn_relu_3                                                                       667.212   0.093    (1, 256, 300, 200)     4       1
fused_nn_conv2d_add_nn_relu_7                                                                           fused_nn_conv2d_add_nn_relu_7                                                                           612.872   0.085    (1, 128, 300, 200)     3       1
fused_nn_conv2d_add_nn_relu_6                                                                           fused_nn_conv2d_add_nn_relu_6                                                                           721.025   0.1      (1, 128, 150, 100)     3       1
fused_nn_conv2d_add_1                                                                                   fused_nn_conv2d_add_1                                                                                   782.087   0.109    (1, 512, 150, 100)     3       1
fused_nn_conv2d_add_add_nn_relu_2                                                                       fused_nn_conv2d_add_add_nn_relu_2                                                                       544.516   0.076    (1, 512, 150, 100)     4       1
fused_nn_conv2d_add_nn_relu_5                                                                           fused_nn_conv2d_add_nn_relu_5                                                                           378.919   0.053    (1, 128, 150, 100)     3       1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_2                                 fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_2                                 586.879   0.081    (1, 128, 150, 100)     3       1
fused_nn_conv2d_add_add_nn_relu_21                                                                      fused_nn_conv2d_add_add_nn_relu_2                                                                       545.552   0.076    (1, 512, 150, 100)     4       1
fused_nn_conv2d_add_nn_relu_51                                                                          fused_nn_conv2d_add_nn_relu_5                                                                           378.946   0.053    (1, 128, 150, 100)     3       1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_21                                fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_2                                 584.278   0.081    (1, 128, 150, 100)     3       1
fused_nn_conv2d_add_add_nn_relu_22                                                                      fused_nn_conv2d_add_add_nn_relu_2                                                                       538.901   0.075    (1, 512, 150, 100)     4       1
fused_nn_conv2d_add_nn_relu_52                                                                          fused_nn_conv2d_add_nn_relu_5                                                                           377.892   0.052    (1, 128, 150, 100)     3       1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_22                                fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_2                                 583.464   0.081    (1, 128, 150, 100)     3       1
fused_nn_conv2d_add_add_nn_relu_23                                                                      fused_nn_conv2d_add_add_nn_relu_2                                                                       535.548   0.074    (1, 512, 150, 100)     4       1
fused_nn_conv2d_add_nn_relu_4                                                                           fused_nn_conv2d_add_nn_relu_4                                                                           651.389   0.09     (1, 256, 150, 100)     3       1
fused_nn_conv2d_add_nn_relu_3                                                                           fused_nn_conv2d_add_nn_relu_3                                                                           765.693   0.106    (1, 256, 75, 50)       3       1
fused_nn_conv2d_add_2                                                                                   fused_nn_conv2d_add_2                                                                                   837.883   0.116    (1, 1024, 75, 50)      3       1
fused_nn_conv2d_add_add_nn_relu_1                                                                       fused_nn_conv2d_add_add_nn_relu_1                                                                       411.432   0.057    (1, 1024, 75, 50)      4       1
fused_nn_conv2d_add_nn_relu_2                                                                           fused_nn_conv2d_add_nn_relu_2                                                                           429.464   0.06     (1, 256, 75, 50)       3       1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_1                                 fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_1                                 441.932   0.061    (1, 256, 75, 50)       3       1
fused_nn_conv2d_add_add_nn_relu_11                                                                      fused_nn_conv2d_add_add_nn_relu_1                                                                       409.714   0.057    (1, 1024, 75, 50)      4       1
fused_nn_conv2d_add_nn_relu_21                                                                          fused_nn_conv2d_add_nn_relu_2                                                                           427.215   0.059    (1, 256, 75, 50)       3       1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_11                                fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_1                                 437.224   0.061    (1, 256, 75, 50)       3       1
fused_nn_conv2d_add_add_nn_relu_12                                                                      fused_nn_conv2d_add_add_nn_relu_1                                                                       409.383   0.057    (1, 1024, 75, 50)      4       1
fused_nn_conv2d_add_nn_relu_22                                                                          fused_nn_conv2d_add_nn_relu_2                                                                           426.843   0.059    (1, 256, 75, 50)       3       1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_12                                fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_1                                 439.42    0.061    (1, 256, 75, 50)       3       1
fused_nn_conv2d_add_add_nn_relu_13                                                                      fused_nn_conv2d_add_add_nn_relu_1                                                                       409.185   0.057    (1, 1024, 75, 50)      4       1
fused_nn_conv2d_add_nn_relu_23                                                                          fused_nn_conv2d_add_nn_relu_2                                                                           427.26    0.059    (1, 256, 75, 50)       3       1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_13                                fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_1                                 438.023   0.061    (1, 256, 75, 50)       3       1
fused_nn_conv2d_add_add_nn_relu_14                                                                      fused_nn_conv2d_add_add_nn_relu_1                                                                       409.543   0.057    (1, 1024, 75, 50)      4       1
fused_nn_conv2d_add_nn_relu_24                                                                          fused_nn_conv2d_add_nn_relu_2                                                                           427.47    0.059    (1, 256, 75, 50)       3       1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_14                                fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_1                                 438.185   0.061    (1, 256, 75, 50)       3       1
fused_nn_conv2d_add_add_nn_relu_15                                                                      fused_nn_conv2d_add_add_nn_relu_1                                                                       408.131   0.057    (1, 1024, 75, 50)      4       1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_4                                 fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu_4                                 2584.37   0.359    (1, 512, 75, 50)       3       1
fused_nn_conv2d_add_3                                                                                   fused_nn_conv2d_add_3                                                                                   73.911    0.01     (1, 30, 75, 50)        3       1
fused_reshape_3                                                                                         fused_reshape_3                                                                                         12.871    0.002    (1, 2, 15, 75, 50)     1       1
fused_nn_softmax_1                                                                                      fused_nn_softmax_1                                                                                      23.772    0.003    (1, 2, 15, 75, 50)     1       1
fused_reshape_2                                                                                         fused_reshape_2                                                                                         11.936    0.002    (1, 30, 75, 50)        1       1
fused_arange_reshape_broadcast_to_reshape_slice_like_broadcast_to_reshape_stride_10214900708559545960_  fused_arange_reshape_broadcast_to_reshape_slice_like_broadcast_to_reshape_stride_10214900708559545960_  11.777    0.002    (1, 56250)             2       1
fused_argsort                                                                                           fused_argsort                                                                                           445590.0  61.855   (1, 56250)             1       1
fused_nn_conv2d_add_4                                                                                   fused_nn_conv2d_add_4                                                                                   79.326    0.011    (1, 60, 75, 50)        3       1
fused_expand_dims_split                                                                                 fused_expand_dims_split                                                                                 23.175    0.003    (1, 1, 1)              1       3
fused_expand_dims_split                                                                                 fused_expand_dims_split                                                                                 23.175    0.003    (1, 1, 1)              1       3
fused_expand_dims_split                                                                                 fused_expand_dims_split                                                                                 23.175    0.003    (1, 1, 1)              1       3
fused_arange_reshape_broadcast_to_reshape_strided_slice_reshape_stack_gather_nd__1662028809018444607_   fused_arange_reshape_broadcast_to_reshape_strided_slice_reshape_stack_gather_nd__1662028809018444607_   18.633    0.003    (1, 6000, 6)           8       1
fused_vision_get_valid_counts                                                                           fused_vision_get_valid_counts                                                                           37.117    0.005    (1,)                   1       2
fused_vision_get_valid_counts                                                                           fused_vision_get_valid_counts                                                                           37.117    0.005    (1, 6000, 6)           1       2
fused_vision_non_max_suppression                                                                        fused_vision_non_max_suppression                                                                        6980.58   0.969    (1, 6000, 6)           2       1
fused_strided_slice_strided_slice                                                                       fused_strided_slice_strided_slice                                                                       13.944    0.002    (1, 1000, 4)           1       1
fused_vision_roi_align                                                                                  fused_vision_roi_align                                                                                  3449.59   0.479    (1, 1000, 1024, 7, 7)  2       1
fused_reshape_1                                                                                         fused_reshape_1                                                                                         1143.54   0.159    (1000, 1024, 7, 7)     1       1
fused_nn_conv2d_add_nn_relu_1                                                                           fused_nn_conv2d_add_nn_relu_1                                                                           9089.74   1.262    (1000, 512, 7, 7)      3       1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu                                   fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu                                   20138.2   2.796    (1000, 512, 7, 7)      3       1
fused_nn_conv2d_add_5                                                                                   fused_nn_conv2d_add_5                                                                                   36522.4   5.07     (1000, 2048, 7, 7)     3       1
fused_nn_conv2d_add_add_nn_relu                                                                         fused_nn_conv2d_add_add_nn_relu                                                                         20964.4   2.91     (1000, 2048, 7, 7)     4       1
fused_nn_conv2d_add_nn_relu                                                                             fused_nn_conv2d_add_nn_relu                                                                             17931.1   2.489    (1000, 512, 7, 7)      3       1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu1                                  fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu                                   21757.3   3.02     (1000, 512, 7, 7)      3       1
fused_nn_conv2d_add_add_nn_relu1                                                                        fused_nn_conv2d_add_add_nn_relu                                                                         21441.5   2.976    (1000, 2048, 7, 7)     4       1
fused_nn_conv2d_add_nn_relu1                                                                            fused_nn_conv2d_add_nn_relu                                                                             17789.9   2.47     (1000, 512, 7, 7)      3       1
fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu2                                  fused_nn_contrib_conv2d_winograd_without_weight_transform_add_nn_relu                                   21733.7   3.017    (1000, 512, 7, 7)      3       1
fused_nn_conv2d_add_add_nn_relu_cast                                                                    fused_nn_conv2d_add_add_nn_relu_cast                                                                    21234.0   2.948    (1000, 2048, 7, 7)     4       1
fused_nn_global_avg_pool2d                                                                              fused_nn_global_avg_pool2d                                                                              6932.16   0.962    (1000, 2048, 1, 1)     1       1
fused_nn_batch_flatten                                                                                  fused_nn_batch_flatten                                                                                  53.207    0.007    (1000, 2048)           1       1
fused_nn_dense_add                                                                                      fused_nn_dense_add                                                                                      1023.42   0.142    (1000, 81)             3       1
fused_nn_softmax                                                                                        fused_nn_softmax                                                                                        27.016    0.004    (1000, 81)             1       1
fused_reshape                                                                                           fused_reshape                                                                                           12.118    0.002    (1, 1000, 81)          1       1
fused_nn_dense_add_1                                                                                    fused_nn_dense_add_1                                                                                    127.332   0.018    (1000, 8)              3       1
fused_reshape_strided_slice_split_multiply_add_strided_slice_split_subtract_add__8298732747436919636_   fused_reshape_strided_slice_split_multiply_add_strided_slice_split_subtract_add__8298732747436919636_   14.025    0.002    (1, 1000, 4)           5       1


Tuning...
[Task  1/25]  Current/Best: 4431.72/5716.17 GFLOPS | Progress: (1960/2000) | 2297.28 s Done.
[Task  2/25]  Current/Best: 3701.48/4152.80 GFLOPS | Progress: (1288/2000) | 1271.70 s Done.
[Task  3/25]  Current/Best: 10083.22/14979.71 GFLOPS | Progress: (1960/2000) | 2623.48 s Done.
[Task  4/25]  Current/Best: 2030.68/2479.63 GFLOPS | Progress: (1120/2000) | 1155.58 s Done.
[Task  5/25]  Current/Best: 1665.80/5162.81 GFLOPS | Progress: (1008/2000) | 1047.14 s Done.
[Task  6/25]  Current/Best: 2229.39/5461.95 GFLOPS | Progress: (1400/2000) | 1565.77 s Done.
[Task  7/25]  Current/Best: 5482.10/7710.09 GFLOPS | Progress: (1288/2000) | 1476.95 s Done.
[Task  8/25]  Current/Best: 4041.71/4630.37 GFLOPS | Progress: (1120/2000) | 1236.02 s Done.
[Task  9/25]  Current/Best: 4941.29/6309.03 GFLOPS | Progress: (840/2000) | 959.14 s Done.
[Task 10/25]  Current/Best: 5063.09/5526.73 GFLOPS | Progress: (1176/2000) | 1629.97 s Done.
[Task 11/25]  Current/Best: 4385.40/5186.10 GFLOPS | Progress: (1064/2000) | 1202.06 s Done.
[Task 12/25]  Current/Best: 6342.40/7178.71 GFLOPS | Progress: (784/2000) | 891.74 s Done.
[Task 13/25]  Current/Best: 5164.69/6910.44 GFLOPS | Progress: (1736/2000) | 2001.48 s Done.
[Task 14/25]  Current/Best: 4362.72/6038.71 GFLOPS | Progress: (952/2000) | 1018.48 s Done.
[Task 15/25]  Current/Best: 7109.94/8156.93 GFLOPS | Progress: (840/2000) | 1088.19 s Done.
[Task 16/25]  Current/Best: 1502.97/5858.54 GFLOPS | Progress: (1008/2000) | 1096.25 s Done.
[Task 17/25]  Current/Best: 4842.55/6618.93 GFLOPS | Progress: (616/2000) | 670.73 s Done.
[Task 18/25]  Current/Best: 3586.62/6484.18 GFLOPS | Progress: (1288/2000) | 1459.45 s Done.
[Task 19/25]  Current/Best: 3652.60/5096.29 GFLOPS | Progress: (840/2000) | 896.95 s Done.
[Task 20/25]  Current/Best: 7877.72/10661.84 GFLOPS | Progress: (672/2000) | 890.28 s Done.
[Task 21/25]  Current/Best: 4341.54/5744.97 GFLOPS | Progress: (672/2000) | 725.49 s Done.
[Task 22/25]  Current/Best: 4031.10/5697.96 GFLOPS | Progress: (840/2000) | 849.97 s Done.
[Task 23/25]  Current/Best: 5242.28/6344.03 GFLOPS | Progress: (1064/2000) | 1133.34 s Done.
[Task 24/25]  Current/Best: 9742.05/11307.57 GFLOPS | Progress: (1064/2000) | 1567.20 s Done.
[Task 25/25]  Current/Best: 4408.99/5533.57 GFLOPS | Progress: (672/2000) | 672.80 s Done.

Best picks provided below due to limit of post length.

{"i": ["cuda -model=unknown", "topi_nn_conv2d", [["TENSOR", [1000, 1024, 7, 7], "float32"], ["TENSOR", [2048, 1024, 1, 1], "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1000, 1024, 7, 7, "float32"], [2048, 1024, 1, 1, "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {"i": 148912, "t": "direct", "c": null, "e": [["tile_f", "sp", [32, 8, 8, 1]], ["tile_y", "sp", [1, 7, 1, 1]], ["tile_x", "sp", [1, 1, 7, 1]], ["tile_rc", "sp", [128, 8]], ["tile_ry", "sp", [1, 1]], ["tile_rx", "sp", [1, 1]], ["auto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 0]]}], "r": [[0.0359543202], 0, 14.21947169303894, 1560760506.9757748], "v": 0.1}
{"i": ["cuda -model=unknown", "topi_nn_conv2d", [["TENSOR", [1, 512, 75, 50], "float32"], ["TENSOR", [60, 512, 1, 1], "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 512, 75, 50, "float32"], [60, 512, 1, 1, "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {"i": 13651485, "t": "direct", "c": null, "e": [["tile_f", "sp", [1, 1, 12, 5]], ["tile_y", "sp", [25, 3, 1, 1]], ["tile_x", "sp", [5, 1, 10, 1]], ["tile_rc", "sp", [64, 8]], ["tile_ry", "sp", [1, 1]], ["tile_rx", "sp", [1, 1]], ["auto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 1]]}], "r": [[5.548067600741084e-05], 0, 28.924411296844482, 1560761905.8601863], "v": 0.1}
{"i": ["cuda -model=unknown", "topi_nn_conv2d", [["TENSOR", [1, 1024, 75, 50], "float32"], ["TENSOR", [512, 1024, 3, 3], "float32"], [1, 1], [1, 1], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 1024, 75, 50, "float32"], [512, 1024, 3, 3, "float32"], [1, 1], [1, 1], [1, 1], "NCHW", "float32"], {"i": 2047667, "t": "winograd", "c": null, "e": [["tile_b", "sp", [16, 1, 1, 1]], ["tile_y", "sp", [4, 2, 16, 4]], ["tile_x", "sp", [19, 5, 10, 1]], ["tile_rc", "sp", [128, 8]], ["auto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 1]]}], "r": [[0.002362492046875], 0, 9.483803510665894, 1560764386.924387], "v": 0.1}
{"i": ["cuda -model=unknown", "topi_nn_conv2d", [["TENSOR", [1, 512, 75, 50], "float32"], ["TENSOR", [30, 512, 1, 1], "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 512, 75, 50, "float32"], [30, 512, 1, 1, "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {"i": 3514933, "t": "direct", "c": null, "e": [["tile_f", "sp", [1, 1, 6, 5]], ["tile_y", "sp", [75, 1, 1, 1]], ["tile_x", "sp", [5, 1, 10, 1]], ["tile_rc", "sp", [32, 16]], ["tile_ry", "sp", [1, 1]], ["tile_rx", "sp", [1, 1]], ["auto_unroll_max_step", "ot", 0], ["unroll_explicit", "ot", 1]]}], "r": [[4.645853228924981e-05], 0, 18.03250002861023, 1560765712.690999], "v": 0.1}
{"i": ["cuda -model=unknown", "topi_nn_conv2d", [["TENSOR", [1, 512, 150, 100], "float32"], ["TENSOR", [1024, 512, 1, 1], "float32"], [2, 2], [0, 0], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 512, 150, 100, "float32"], [1024, 512, 1, 1, "float32"], [2, 2], [0, 0], [1, 1], "NCHW", "float32"], {"i": 14803462, "t": "direct", "c": null, "e": [["tile_f", "sp", [8, 4, 16, 2]], ["tile_y", "sp", [75, 1, 1, 1]], ["tile_x", "sp", [1, 5, 10, 1]], ["tile_rc", "sp", [128, 4]], ["tile_ry", "sp", [1, 1]], ["tile_rx", "sp", [1, 1]], ["auto_unroll_max_step", "ot", 0], ["unroll_explicit", "ot", 1]]}], "r": [[0.0007616319305993691], 0, 17.055360555648804, 1560766821.2114425], "v": 0.1}
{"i": ["cuda -model=unknown", "topi_nn_conv2d", [["TENSOR", [1, 256, 300, 200], "float32"], ["TENSOR", [512, 256, 1, 1], "float32"], [2, 2], [0, 0], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 256, 300, 200, "float32"], [512, 256, 1, 1, "float32"], [2, 2], [0, 0], [1, 1], "NCHW", "float32"], {"i": 39295172, "t": "direct", "c": null, "e": [["tile_f", "sp", [2, 4, 32, 2]], ["tile_y", "sp", [75, 1, 1, 2]], ["tile_x", "sp", [5, 5, 4, 1]], ["tile_rc", "sp", [64, 4]], ["tile_ry", "sp", [1, 1]], ["tile_rx", "sp", [1, 1]], ["auto_unroll_max_step", "ot", 512], ["unroll_explicit", "ot", 0]]}], "r": [[0.0007199190717703349], 0, 26.579092741012573, 1560768454.0396535], "v": 0.1}
{"i": ["cuda -model=unknown", "topi_nn_conv2d", [["TENSOR", [1, 3, 1200, 800], "float32"], ["TENSOR", [64, 3, 7, 7], "float32"], [2, 2], [3, 3], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 3, 1200, 800, "float32"], [64, 3, 7, 7, "float32"], [2, 2], [3, 3], [1, 1], "NCHW", "float32"], {"i": 991939279, "t": "direct", "c": null, "e": [["tile_f", "sp", [1, 1, 4, 16]], ["tile_y", "sp", [600, 1, 1, 1]], ["tile_x", "sp", [5, 5, 16, 1]], ["tile_rc", "sp", [3, 1]], ["tile_ry", "sp", [1, 7]], ["tile_rx", "sp", [7, 1]], ["auto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 1]]}], "r": [[0.0005857053385214007], 0, 28.493815183639526, 1560769961.7680717], "v": 0.1}
{"i": ["cuda -model=unknown", "topi_nn_conv2d", [["TENSOR", [1, 64, 300, 200], "float32"], ["TENSOR", [64, 64, 1, 1], "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 64, 300, 200, "float32"], [64, 64, 1, 1, "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {"i": 209531185, "t": "direct", "c": null, "e": [["tile_f", "sp", [1, 1, 8, 8]], ["tile_y", "sp", [150, 1, 2, 1]], ["tile_x", "sp", [5, 5, 8, 1]], ["tile_rc", "sp", [8, 8]], ["tile_ry", "sp", [1, 1]], ["tile_rx", "sp", [1, 1]], ["auto_unroll_max_step", "ot", 512], ["unroll_explicit", "ot", 1]]}], "r": [[0.00010615139364518976], 0, 4.5457189083099365, 1560771399.4905434], "v": 0.1}
{"i": ["cuda -model=unknown", "topi_nn_conv2d", [["TENSOR", [1, 256, 300, 200], "float32"], ["TENSOR", [64, 256, 1, 1], "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 256, 300, 200, "float32"], [64, 256, 1, 1, "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {"i": 323771157, "t": "direct", "c": null, "e": [["tile_f", "sp", [1, 4, 8, 2]], ["tile_y", "sp", [150, 1, 2, 1]], ["tile_x", "sp", [5, 5, 8, 1]], ["tile_rc", "sp", [32, 8]], ["tile_ry", "sp", [1, 1]], ["tile_rx", "sp", [1, 1]], ["auto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 1]]}], "r": [[0.00031162955076923074], 0, 3.72472882270813, 1560772410.7948742], "v": 0.1}
{"i": ["cuda -model=unknown", "topi_nn_conv2d", [["TENSOR", [1, 64, 300, 200], "float32"], ["TENSOR", [64, 64, 3, 3], "float32"], [1, 1], [1, 1], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 64, 300, 200, "float32"], [64, 64, 3, 3, "float32"], [1, 1], [1, 1], [1, 1], "NCHW", "float32"], {"i": 7343275, "t": "winograd", "c": null, "e": [["tile_b", "sp", [16, 1, 1, 1]], ["tile_y", "sp", [1, 1, 4, 16]], ["tile_x", "sp", [125, 2, 30, 2]], ["tile_rc", "sp", [8, 8]], ["auto_unroll_max_step", "ot", 128], ["unroll_explicit", "ot", 1]]}], "r": [[0.0008004157946127947], 0, 5.334130764007568, 1560773916.4439282], "v": 0.1}
{"i": ["cuda -model=unknown", "topi_nn_conv2d", [["TENSOR", [1, 64, 300, 200], "float32"], ["TENSOR", [256, 64, 1, 1], "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 64, 300, 200, "float32"], [256, 64, 1, 1, "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {"i": 134376073, "t": "direct", "c": null, "e": [["tile_f", "sp", [2, 4, 16, 2]], ["tile_y", "sp", [300, 1, 1, 1]], ["tile_x", "sp", [5, 5, 8, 1]], ["tile_rc", "sp", [8, 8]], ["tile_ry", "sp", [1, 1]], ["tile_rx", "sp", [1, 1]], ["auto_unroll_max_step", "ot", 512], ["unroll_explicit", "ot", 0]]}], "r": [[0.0003791056136363636], 0, 8.464695930480957, 1560775466.169432], "v": 0.1}
{"i": ["cuda -model=unknown", "topi_nn_conv2d", [["TENSOR", [1, 256, 300, 200], "float32"], ["TENSOR", [128, 256, 1, 1], "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 256, 300, 200, "float32"], [128, 256, 1, 1, "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {"i": 464450220, "t": "direct", "c": null, "e": [["tile_f", "sp", [1, 4, 16, 2]], ["tile_y", "sp", [150, 1, 2, 1]], ["tile_x", "sp", [5, 5, 4, 2]], ["tile_rc", "sp", [32, 8]], ["tile_ry", "sp", [1, 1]], ["tile_rx", "sp", [1, 1]], ["auto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 1]]}], "r": [[0.0005477527278911564], 0, 9.310503721237183, 1560776460.44932], "v": 0.1}
{"i": ["cuda -model=unknown", "topi_nn_conv2d", [["TENSOR", [1, 128, 300, 200], "float32"], ["TENSOR", [128, 128, 3, 3], "float32"], [2, 2], [1, 1], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 128, 300, 200, "float32"], [128, 128, 3, 3, "float32"], [2, 2], [1, 1], [1, 1], "NCHW", "float32"], {"i": 293114428, "t": "direct", "c": null, "e": [["tile_f", "sp", [2, 4, 16, 1]], ["tile_y", "sp", [25, 1, 3, 2]], ["tile_x", "sp", [10, 1, 2, 5]], ["tile_rc", "sp", [128, 1]], ["tile_ry", "sp", [1, 3]], ["tile_rx", "sp", [1, 3]], ["auto_unroll_max_step", "ot", 512], ["unroll_explicit", "ot", 1]]}], "r": [[0.0006401443386243387], 0, 13.907016515731812, 1560778504.3489592], "v": 0.1}
{"i": ["cuda -model=unknown", "topi_nn_conv2d", [["TENSOR", [1, 512, 150, 100], "float32"], ["TENSOR", [128, 512, 1, 1], "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 512, 150, 100, "float32"], [128, 512, 1, 1, "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {"i": 102049740, "t": "direct", "c": null, "e": [["tile_f", "sp", [1, 4, 16, 2]], ["tile_y", "sp", [15, 5, 2, 1]], ["tile_x", "sp", [25, 1, 4, 1]], ["tile_rc", "sp", [64, 8]], ["tile_ry", "sp", [1, 1]], ["tile_rx", "sp", [1, 1]], ["auto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 1]]}], "r": [[0.00032557958351409977], 0, 25.70450258255005, 1560779795.5078504], "v": 0.1}
{"i": ["cuda -model=unknown", "topi_nn_conv2d", [["TENSOR", [1, 128, 150, 100], "float32"], ["TENSOR", [128, 128, 3, 3], "float32"], [1, 1], [1, 1], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 128, 150, 100, "float32"], [128, 128, 3, 3, "float32"], [1, 1], [1, 1], [1, 1], "NCHW", "float32"], {"i": 2964459, "t": "winograd", "c": null, "e": [["tile_b", "sp", [16, 1, 1, 1]], ["tile_y", "sp", [1, 1, 16, 8]], ["tile_x", "sp", [75, 5, 10, 1]], ["tile_rc", "sp", [8, 16]], ["auto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 1]]}], "r": [[0.0005423218050541516], 0, 9.853091955184937, 1560780822.980578], "v": 0.1}
{"i": ["cuda -model=unknown", "topi_nn_conv2d", [["TENSOR", [1, 128, 150, 100], "float32"], ["TENSOR", [512, 128, 1, 1], "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 128, 150, 100, "float32"], [512, 128, 1, 1, "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {"i": 123779082, "t": "direct", "c": null, "e": [["tile_f", "sp", [4, 4, 32, 1]], ["tile_y", "sp", [50, 1, 1, 3]], ["tile_x", "sp", [5, 5, 4, 1]], ["tile_rc", "sp", [16, 8]], ["tile_ry", "sp", [1, 1]], ["tile_rx", "sp", [1, 1]], ["auto_unroll_max_step", "ot", 512], ["unroll_explicit", "ot", 1]]}], "r": [[0.00033559198044692736], 0, 10.219220399856567, 1560782324.7898612], "v": 0.1}
{"i": ["cuda -model=unknown", "topi_nn_conv2d", [["TENSOR", [1, 512, 150, 100], "float32"], ["TENSOR", [256, 512, 1, 1], "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 512, 150, 100, "float32"], [256, 512, 1, 1, "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {"i": 34744453, "t": "direct", "c": null, "e": [["tile_f", "sp", [2, 4, 16, 2]], ["tile_y", "sp", [75, 1, 2, 1]], ["tile_x", "sp", [5, 5, 4, 1]], ["tile_rc", "sp", [64, 8]], ["tile_ry", "sp", [1, 1]], ["tile_rx", "sp", [1, 1]], ["auto_unroll_max_step", "ot", 512], ["unroll_explicit", "ot", 0]]}], "r": [[0.0005940778228228228], 0, 4.457234621047974, 1560783219.5559137], "v": 0.1}
{"i": ["cuda -model=unknown", "topi_nn_conv2d", [["TENSOR", [1, 256, 150, 100], "float32"], ["TENSOR", [256, 256, 3, 3], "float32"], [2, 2], [1, 1], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 256, 150, 100, "float32"], [256, 256, 3, 3, "float32"], [2, 2], [1, 1], [1, 1], "NCHW", "float32"], {"i": 45171496, "t": "direct", "c": null, "e": [["tile_f", "sp", [2, 1, 32, 4]], ["tile_y", "sp", [25, 1, 3, 1]], ["tile_x", "sp", [2, 25, 1, 1]], ["tile_rc", "sp", [256, 1]], ["tile_ry", "sp", [1, 3]], ["tile_rx", "sp", [1, 3]], ["auto_unroll_max_step", "ot", 512], ["unroll_explicit", "ot", 1]]}], "r": [[0.0006822266363636363], 0, 10.08177924156189, 1560784907.6363897], "v": 0.1}
{"i": ["cuda -model=unknown", "topi_nn_conv2d", [["TENSOR", [1, 1024, 75, 50], "float32"], ["TENSOR", [256, 1024, 1, 1], "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 1024, 75, 50, "float32"], [256, 1024, 1, 1, "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {"i": 3481210, "t": "direct", "c": null, "e": [["tile_f", "sp", [2, 2, 64, 1]], ["tile_y", "sp", [25, 1, 1, 3]], ["tile_x", "sp", [5, 5, 2, 1]], ["tile_rc", "sp", [256, 4]], ["tile_ry", "sp", [1, 1]], ["tile_rx", "sp", [1, 1]], ["auto_unroll_max_step", "ot", 512], ["unroll_explicit", "ot", 0]]}], "r": [[0.0003857864], 0, 10.819404602050781, 1560785973.318506], "v": 0.1}
{"i": ["cuda -model=unknown", "topi_nn_conv2d", [["TENSOR", [1, 256, 75, 50], "float32"], ["TENSOR", [256, 256, 3, 3], "float32"], [1, 1], [1, 1], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 256, 75, 50, "float32"], [256, 256, 3, 3, "float32"], [1, 1], [1, 1], [1, 1], "NCHW", "float32"], {"i": 1034182, "t": "winograd", "c": null, "e": [["tile_b", "sp", [16, 1, 1, 1]], ["tile_y", "sp", [2, 1, 16, 8]], ["tile_x", "sp", [19, 5, 10, 1]], ["tile_rc", "sp", [32, 8]], ["auto_unroll_max_step", "ot", 128], ["unroll_explicit", "ot", 1]]}], "r": [[0.00041490784806629834], 0, 13.476664781570435, 1560786710.4725316], "v": 0.1}
{"i": ["cuda -model=unknown", "topi_nn_conv2d", [["TENSOR", [1, 256, 75, 50], "float32"], ["TENSOR", [1024, 256, 1, 1], "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1, 256, 75, 50, "float32"], [1024, 256, 1, 1, "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {"i": 22582601, "t": "direct", "c": null, "e": [["tile_f", "sp", [8, 8, 16, 1]], ["tile_y", "sp", [75, 1, 1, 1]], ["tile_x", "sp", [1, 5, 10, 1]], ["tile_rc", "sp", [16, 16]], ["tile_ry", "sp", [1, 1]], ["tile_rx", "sp", [1, 1]], ["auto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 1]]}], "r": [[0.0003422264039548023], 0, 17.08689785003662, 1560787835.9578903], "v": 0.1}
{"i": ["cuda -model=unknown", "topi_nn_conv2d", [["TENSOR", [1000, 1024, 7, 7], "float32"], ["TENSOR", [512, 1024, 1, 1], "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1000, 1024, 7, 7, "float32"], [512, 1024, 1, 1, "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {"i": 199577, "t": "direct", "c": null, "e": [["tile_f", "sp", [4, 8, 16, 1]], ["tile_y", "sp", [1, 1, 1, 7]], ["tile_x", "sp", [1, 1, 7, 1]], ["tile_rc", "sp", [512, 2]], ["tile_ry", "sp", [1, 1]], ["tile_rx", "sp", [1, 1]], ["auto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 1]]}], "r": [[0.009017299650000001], 0, 9.88717007637024, 1560789096.0973315], "v": 0.1}
{"i": ["cuda -model=unknown", "topi_nn_conv2d", [["TENSOR", [1000, 2048, 7, 7], "float32"], ["TENSOR", [512, 2048, 1, 1], "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1000, 2048, 7, 7, "float32"], [512, 2048, 1, 1, "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {"i": 182021, "t": "direct", "c": null, "e": [["tile_f", "sp", [8, 4, 8, 2]], ["tile_y", "sp", [1, 1, 1, 7]], ["tile_x", "sp", [1, 1, 7, 1]], ["tile_rc", "sp", [256, 8]], ["tile_ry", "sp", [1, 1]], ["tile_rx", "sp", [1, 1]], ["auto_unroll_max_step", "ot", 512], ["unroll_explicit", "ot", 1]]}], "r": [[0.016197972550000002], 0, 9.636692762374878, 1560790470.9960752], "v": 0.1}
{"i": ["cuda -model=unknown", "topi_nn_conv2d", [["TENSOR", [1000, 512, 7, 7], "float32"], ["TENSOR", [512, 512, 3, 3], "float32"], [1, 1], [1, 1], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1000, 512, 7, 7, "float32"], [512, 512, 3, 3, "float32"], [1, 1], [1, 1], [1, 1], "NCHW", "float32"], {"i": 28017343, "t": "winograd", "c": null, "e": [["tile_b", "sp", [16, 1, 1, 1]], ["tile_y", "sp", [4, 4, 8, 4]], ["tile_x", "sp", [250, 4, 16, 1]], ["tile_rc", "sp", [64, 8]], ["auto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 1]]}], "r": [[0.020447460049999998], 0, 9.400633096694946, 1560791892.5671005], "v": 0.1}
{"i": ["cuda -model=unknown", "topi_nn_conv2d", [["TENSOR", [1000, 512, 7, 7], "float32"], ["TENSOR", [2048, 512, 1, 1], "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {}, ["conv2d", [1000, 512, 7, 7, "float32"], [2048, 512, 1, 1, "float32"], [1, 1], [0, 0], [1, 1], "NCHW", "float32"], {"i": 68113, "t": "direct", "c": null, "e": [["tile_f", "sp", [16, 8, 16, 1]], ["tile_y", "sp", [1, 1, 1, 7]], ["tile_x", "sp", [1, 1, 7, 1]], ["tile_rc", "sp", [256, 2]], ["tile_ry", "sp", [1, 1]], ["tile_rx", "sp", [1, 1]], ["auto_unroll_max_step", "ot", 512], ["unroll_explicit", "ot", 0]]}], "r": [[0.018570371250000002], 0, 20.369359016418457, 1560792975.1450012], "v": 0.1}

In my experience, larger batch can be difficult to tune. You can try setting larger number of n_trial, early_stopping and n_iter in sa optimizer.

cc @Laurawly @merrymercy @eqy

@vinx13 After tuning with 6000 iterations and no early stopping for large batch convs, there are no significant improvement.

With a batch size of 1000, you can also try tiling over the batch dimension.

You may see some improvement by changing the conv2d template (a simple experiment would be to just substitute the batch dimension in place of the filter dimension or to fuse them before tiling).