Speed, why not all networks can be faster?

Several colleagues and I worked on TVM for several weeks and found that some of the network speed training results could not be faster than the previous models (for example, based on pytorch), but some models were faster than the previous models.

It is very good that the display memory can be significantly reduced, especially in C + +.

Speed, why not all networks can be faster?

我和几个同事搞TVM几个星期,发现部分网络速度训练的结果都没法比之前的模型快(比如基于pytorch),但有些模型比之前快。

显存都能明显降低,特别是在c++上显存降得更多,这一点非常好。

速度,为什么速度不是所有的网络都能更快?

We’ve also seen similar issue. Some models have performance regression but some become faster. Have you tried to rerun autotuning?

We try autotvm to train autotune. It seems that some networks just can’t go up in speed. It’s really impatient

我们尝试autotvm训练autotune,貌似有些网络就是速度上不去,确实很无耐

What’s the workloads that cause performance bottlneck in the slow network?

Centernet is based on DLA network. The result of 500 iterations of autotvm is as follows. One layer of 7 * 7 convolution gflog is not ideal,Black bold below,Libtorch model 30ms, TVM needs 108ms after autotvm

centernet基于dla网络,autotvm经过500次迭代的结果log在下面,有一层7*7卷积GFLOPS不理想,下面标黑色粗体部分,libtorch模型30ms,tvm经过autotvm后需要108ms

{“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 256, 152, 152], “float32”], [“TENSOR”, [2, 256, 1, 1], “float32”], [1, 1], [0, 0, 0, 0], [1, 1], “float32”], {}], “config”: {“index”: 953954, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 1, 2, 1]], [“tile_y”, “sp”, [-1, 1, 2, 1]], [“tile_x”, “sp”, [-1, 2, 19, 1]], [“tile_rc”, “sp”, [-1, 2]], [“tile_ry”, “sp”, [-1, 1]], [“tile_rx”, “sp”, [-1, 1]], [“auto_unroll_max_step”, “ot”, 512], [“unroll_explicit”, “ot”, 1]]}, “result”: [[0.0001107884], 0, 2.192681074142456, 1589206637.480167], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 256, 19, 19], “float32”], [“TENSOR”, [512, 256, 1, 1], “float32”], [1, 1], [0, 0, 0, 0], [1, 1], “float32”], {}], “config”: {“index”: 139965, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 1, 64, 1]], [“tile_y”, “sp”, [-1, 1, 1, 1]], [“tile_x”, “sp”, [-1, 1, 1, 19]], [“tile_rc”, “sp”, [-1, 8]], [“tile_ry”, “sp”, [-1, 1]], [“tile_rx”, “sp”, [-1, 1]], [“auto_unroll_max_step”, “ot”, 512], [“unroll_explicit”, “ot”, 1]]}, “result”: [[6.30488e-05], 0, 2.2756543159484863, 1589207242.555895], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 256, 38, 38], “float32”], [“TENSOR”, [512, 256, 3, 3], “float32”], [2, 2], [1, 1, 1, 1], [1, 1], “float32”], {}], “config”: {“index”: 698757, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 8, 16, 1]], [“tile_y”, “sp”, [-1, 1, 1, 1]], [“tile_x”, “sp”, [-1, 1, 19, 1]], [“tile_rc”, “sp”, [-1, 1]], [“tile_ry”, “sp”, [-1, 1]], [“tile_rx”, “sp”, [-1, 3]], [“auto_unroll_max_step”, “ot”, 1500], [“unroll_explicit”, “ot”, 1]]}, “result”: [[0.0007544763999999999], 0, 0.7088687419891357, 1589207308.542092], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw_winograd.cuda”, [[“TENSOR”, [1, 512, 19, 19], “float32”], [“TENSOR”, [512, 512, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “float32”], {}], “config”: {“index”: 513603, “code_hash”: null, “entity”: [[“tile_b”, “sp”, [-1, 1, 1, 1]], [“tile_y”, “sp”, [-1, 4, 8, 4]], [“tile_x”, “sp”, [-1, 2, 50, 1]], [“tile_rc”, “sp”, [-1, 8]], [“auto_unroll_max_step”, “ot”, 1500], [“unroll_explicit”, “ot”, 0]]}, “result”: [[0.0002702084], 0, 1.519629716873169, 1589210244.0758305], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 512, 19, 19], “float32”], [“TENSOR”, [512, 512, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “float32”], {}], “config”: {“index”: 357402, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 2, 8, 4]], [“tile_y”, “sp”, [-1, 1, 1, 1]], [“tile_x”, “sp”, [-1, 1, 19, 1]], [“tile_rc”, “sp”, [-1, 2]], [“tile_ry”, “sp”, [-1, 1]], [“tile_rx”, “sp”, [-1, 3]], [“auto_unroll_max_step”, “ot”, 1500], [“unroll_explicit”, “ot”, 0]]}, “result”: [[0.0011266126], 0, 1.110652208328247, 1589210979.4589293], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 1280, 19, 19], “float32”], [“TENSOR”, [512, 1280, 1, 1], “float32”], [1, 1], [0, 0, 0, 0], [1, 1], “float32”], {}], “config”: {“index”: 20285, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 1, 64, 1]], [“tile_y”, “sp”, [-1, 1, 1, 1]], [“tile_x”, “sp”, [-1, 1, 1, 19]], [“tile_rc”, “sp”, [-1, 10]], [“tile_ry”, “sp”, [-1, 1]], [“tile_rx”, “sp”, [-1, 1]], [“auto_unroll_max_step”, “ot”, 0], [“unroll_explicit”, “ot”, 0]]}, “result”: [[0.00024457659999999997], 0, 1.0802662372589111, 1589211900.5497859], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 512, 19, 19], “float32”], [“TENSOR”, [256, 512, 1, 1], “float32”], [1, 1], [0, 0, 0, 0], [1, 1], “float32”], {}], “config”: {“index”: 115535, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 1, 32, 1]], [“tile_y”, “sp”, [-1, 1, 1, 1]], [“tile_x”, “sp”, [-1, 1, 1, 19]], [“tile_rc”, “sp”, [-1, 8]], [“tile_ry”, “sp”, [-1, 1]], [“tile_rx”, “sp”, [-1, 1]], [“auto_unroll_max_step”, “ot”, 512], [“unroll_explicit”, “ot”, 1]]}, “result”: [[6.33934e-05], 0, 1.5696065425872803, 1589212281.4407008], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw_winograd.cuda”, [[“TENSOR”, [1, 512, 38, 38], “float32”], [“TENSOR”, [256, 512, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “float32”], {}], “config”: {“index”: 84390, “code_hash”: null, “entity”: [[“tile_b”, “sp”, [-1, 1, 1, 1]], [“tile_y”, “sp”, [-1, 1, 32, 2]], [“tile_x”, “sp”, [-1, 19, 1, 1]], [“tile_rc”, “sp”, [-1, 2]], [“auto_unroll_max_step”, “ot”, 1500], [“unroll_explicit”, “ot”, 1]]}, “result”: [[0.0006047687999999999], 0, 2.7428460121154785, 1589213641.0142717], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 512, 38, 38], “float32”], [“TENSOR”, [256, 512, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “float32”], {}], “config”: {“index”: 3034055, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 1, 32, 1]], [“tile_y”, “sp”, [-1, 1, 2, 1]], [“tile_x”, “sp”, [-1, 2, 1, 19]], [“tile_rc”, “sp”, [-1, 2]], [“tile_ry”, “sp”, [-1, 3]], [“tile_rx”, “sp”, [-1, 3]], [“auto_unroll_max_step”, “ot”, 512], [“unroll_explicit”, “ot”, 0]]}, “result”: [[0.0005977954], 0, 2.14514422416687, 1589214995.6112628], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 128, 38, 38], “float32”], [“TENSOR”, [256, 128, 1, 1], “float32”], [1, 1], [0, 0, 0, 0], [1, 1], “float32”], {}], “config”: {“index”: 1110671, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 8, 2, 2]], [“tile_y”, “sp”, [-1, 1, 19, 2]], [“tile_x”, “sp”, [-1, 1, 2, 1]], [“tile_rc”, “sp”, [-1, 4]], [“tile_ry”, “sp”, [-1, 1]], [“tile_rx”, “sp”, [-1, 1]], [“auto_unroll_max_step”, “ot”, 0], [“unroll_explicit”, “ot”, 1]]}, “result”: [[5.4277e-05], 0, 0.8081059455871582, 1589252464.3522549], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 128, 76, 76], “float32”], [“TENSOR”, [256, 128, 3, 3], “float32”], [2, 2], [1, 1, 1, 1], [1, 1], “float32”], {}], “config”: {“index”: 5478688, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 16, 8, 1]], [“tile_y”, “sp”, [-1, 1, 2, 1]], [“tile_x”, “sp”, [-1, 1, 19, 2]], [“tile_rc”, “sp”, [-1, 2]], [“tile_ry”, “sp”, [-1, 1]], [“tile_rx”, “sp”, [-1, 1]], [“auto_unroll_max_step”, “ot”, 512], [“unroll_explicit”, “ot”, 1]]}, “result”: [[0.0004979601999999999], 0, 0.714576244354248, 1589252609.129768], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 512, 38, 38], “float32”], [“TENSOR”, [256, 512, 1, 1], “float32”], [1, 1], [0, 0, 0, 0], [1, 1], “float32”], {}], “config”: {“index”: 948510, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 4, 2, 4]], [“tile_y”, “sp”, [-1, 1, 2, 1]], [“tile_x”, “sp”, [-1, 2, 19, 1]], [“tile_rc”, “sp”, [-1, 4]], [“tile_ry”, “sp”, [-1, 1]], [“tile_rx”, “sp”, [-1, 1]], [“auto_unroll_max_step”, “ot”, 1500], [“unroll_explicit”, “ot”, 0]]}, “result”: [[0.0001393604], 0, 1.0768027305603027, 1589253805.4516308], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw_winograd.cuda”, [[“TENSOR”, [1, 256, 38, 38], “float32”], [“TENSOR”, [256, 256, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “float32”], {}], “config”: {“index”: 79827, “code_hash”: null, “entity”: [[“tile_b”, “sp”, [-1, 1, 1, 1]], [“tile_y”, “sp”, [-1, 4, 1, 16]], [“tile_x”, “sp”, [-1, 1, 19, 1]], [“tile_rc”, “sp”, [-1, 8]], [“auto_unroll_max_step”, “ot”, 1500], [“unroll_explicit”, “ot”, 1]]}, “result”: [[0.00049337], 0, 1.4300451278686523, 1589254372.1657565], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 256, 38, 38], “float32”], [“TENSOR”, [256, 256, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “float32”], {}], “config”: {“index”: 438525, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 1, 4, 8]], [“tile_y”, “sp”, [-1, 2, 1, 1]], [“tile_x”, “sp”, [-1, 1, 19, 1]], [“tile_rc”, “sp”, [-1, 2]], [“tile_ry”, “sp”, [-1, 3]], [“tile_rx”, “sp”, [-1, 1]], [“auto_unroll_max_step”, “ot”, 0], [“unroll_explicit”, “ot”, 0]]}, “result”: [[0.0008059686], 0, 0.8656125068664551, 1589255813.2522068], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 896, 38, 38], “float32”], [“TENSOR”, [256, 896, 1, 1], “float32”], [1, 1], [0, 0, 0, 0], [1, 1], “float32”], {}], “config”: {“index”: 3538335, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 1, 32, 2]], [“tile_y”, “sp”, [-1, 1, 2, 1]], [“tile_x”, “sp”, [-1, 1, 1, 19]], [“tile_rc”, “sp”, [-1, 7]], [“tile_ry”, “sp”, [-1, 1]], [“tile_rx”, “sp”, [-1, 1]], [“auto_unroll_max_step”, “ot”, 1500], [“unroll_explicit”, “ot”, 1]]}, “result”: [[0.0002424996], 0, 0.9676973819732666, 1589256572.6785383], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 256, 38, 38], “float32”], [“TENSOR”, [128, 256, 1, 1], “float32”], [1, 1], [0, 0, 0, 0], [1, 1], “float32”], {}], “config”: {“index”: 1519856, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 4, 8, 2]], [“tile_y”, “sp”, [-1, 1, 1, 2]], [“tile_x”, “sp”, [-1, 2, 19, 1]], [“tile_rc”, “sp”, [-1, 16]], [“tile_ry”, “sp”, [-1, 1]], [“tile_rx”, “sp”, [-1, 1]], [“auto_unroll_max_step”, “ot”, 1500], [“unroll_explicit”, “ot”, 1]]}, “result”: [[4.5609200000000004e-05], 0, 1.233633279800415, 1589257275.8399825], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw_winograd.cuda”, [[“TENSOR”, [1, 256, 76, 76], “float32”], [“TENSOR”, [128, 256, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “float32”], {}], “config”: {“index”: 470002, “code_hash”: null, “entity”: [[“tile_b”, “sp”, [-1, 1, 1, 1]], [“tile_y”, “sp”, [-1, 1, 16, 4]], [“tile_x”, “sp”, [-1, 19, 4, 1]], [“tile_rc”, “sp”, [-1, 8]], [“auto_unroll_max_step”, “ot”, 128], [“unroll_explicit”, “ot”, 1]]}, “result”: [[0.0005055452], 0, 1.1845457553863525, 1589258209.1853924], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 256, 76, 76], “float32”], [“TENSOR”, [128, 256, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “float32”], {}], “config”: {“index”: 39841258, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 1, 16, 2]], [“tile_y”, “sp”, [-1, 1, 4, 1]], [“tile_x”, “sp”, [-1, 19, 1, 2]], [“tile_rc”, “sp”, [-1, 1]], [“tile_ry”, “sp”, [-1, 3]], [“tile_rx”, “sp”, [-1, 3]], [“auto_unroll_max_step”, “ot”, 1500], [“unroll_explicit”, “ot”, 1]]}, “result”: [[0.0006683692], 0, 1.4335763454437256, 1589259195.5899866], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 64, 76, 76], “float32”], [“TENSOR”, [128, 64, 1, 1], “float32”], [1, 1], [0, 0, 0, 0], [1, 1], “float32”], {}], “config”: {“index”: 5843867, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 1, 4, 16]], [“tile_y”, “sp”, [-1, 1, 1, 2]], [“tile_x”, “sp”, [-1, 1, 76, 1]], [“tile_rc”, “sp”, [-1, 4]], [“tile_ry”, “sp”, [-1, 1]], [“tile_rx”, “sp”, [-1, 1]], [“auto_unroll_max_step”, “ot”, 512], [“unroll_explicit”, “ot”, 1]]}, “result”: [[4.7828200000000004e-05], 0, 0.8872570991516113, 1589259202.0296946], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 64, 152, 152], “float32”], [“TENSOR”, [128, 64, 3, 3], “float32”], [2, 2], [1, 1, 1, 1], [1, 1], “float32”], {}], “config”: {“index”: 20458347, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 2, 16, 1]], [“tile_y”, “sp”, [-1, 1, 2, 1]], [“tile_x”, “sp”, [-1, 1, 2, 2]], [“tile_rc”, “sp”, [-1, 2]], [“tile_ry”, “sp”, [-1, 3]], [“tile_rx”, “sp”, [-1, 3]], [“auto_unroll_max_step”, “ot”, 0], [“unroll_explicit”, “ot”, 1]]}, “result”: [[0.0005543442], 0, 1.0131065845489502, 1589260730.2821236], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 256, 76, 76], “float32”], [“TENSOR”, [128, 256, 1, 1], “float32”], [1, 1], [0, 0, 0, 0], [1, 1], “float32”], {}], “config”: {“index”: 3914212, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 8, 4, 2]], [“tile_y”, “sp”, [-1, 1, 1, 2]], [“tile_x”, “sp”, [-1, 1, 38, 1]], [“tile_rc”, “sp”, [-1, 4]], [“tile_ry”, “sp”, [-1, 1]], [“tile_rx”, “sp”, [-1, 1]], [“auto_unroll_max_step”, “ot”, 1500], [“unroll_explicit”, “ot”, 0]]}, “result”: [[0.0001424364], 0, 0.8473877906799316, 1589261373.4617639], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw_winograd.cuda”, [[“TENSOR”, [1, 128, 76, 76], “float32”], [“TENSOR”, [128, 128, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “float32”], {}], “config”: {“index”: 434340, “code_hash”: null, “entity”: [[“tile_b”, “sp”, [-1, 1, 1, 1]], [“tile_y”, “sp”, [-1, 4, 16, 2]], [“tile_x”, “sp”, [-1, 2, 19, 1]], [“tile_rc”, “sp”, [-1, 16]], [“auto_unroll_max_step”, “ot”, 128], [“unroll_explicit”, “ot”, 1]]}, “result”: [[0.000381914], 0, 0.7877893447875977, 1589261389.6940064], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 128, 76, 76], “float32”], [“TENSOR”, [128, 128, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “float32”], {}], “config”: {“index”: 4839175, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 2, 8, 2]], [“tile_y”, “sp”, [-1, 1, 2, 1]], [“tile_x”, “sp”, [-1, 19, 2, 1]], [“tile_rc”, “sp”, [-1, 2]], [“tile_ry”, “sp”, [-1, 3]], [“tile_rx”, “sp”, [-1, 3]], [“auto_unroll_max_step”, “ot”, 0], [“unroll_explicit”, “ot”, 0]]}, “result”: [[0.0003446886], 0, 1.233992576599121, 1589265095.6358092], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 448, 76, 76], “float32”], [“TENSOR”, [128, 448, 1, 1], “float32”], [1, 1], [0, 0, 0, 0], [1, 1], “float32”], {}], “config”: {“index”: 11033615, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 2, 4, 8]], [“tile_y”, “sp”, [-1, 1, 38, 2]], [“tile_x”, “sp”, [-1, 1, 1, 2]], [“tile_rc”, “sp”, [-1, 2]], [“tile_ry”, “sp”, [-1, 1]], [“tile_rx”, “sp”, [-1, 1]], [“auto_unroll_max_step”, “ot”, 512], [“unroll_explicit”, “ot”, 1]]}, “result”: [[0.0002916576], 0, 0.8454580307006836, 1589265708.6270657], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 128, 76, 76], “float32”], [“TENSOR”, [64, 128, 1, 1], “float32”], [1, 1], [0, 0, 0, 0], [1, 1], “float32”], {}], “config”: {“index”: 4604417, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 4, 4, 2]], [“tile_y”, “sp”, [-1, 4, 19, 1]], [“tile_x”, “sp”, [-1, 1, 4, 1]], [“tile_rc”, “sp”, [-1, 4]], [“tile_ry”, “sp”, [-1, 1]], [“tile_rx”, “sp”, [-1, 1]], [“auto_unroll_max_step”, “ot”, 512], [“unroll_explicit”, “ot”, 1]]}, “result”: [[5.4847200000000006e-05], 0, 0.9606702327728271, 1589266682.609816], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 32, 152, 152], “float32”], [“TENSOR”, [64, 32, 1, 1], “float32”], [1, 1], [0, 0, 0, 0], [1, 1], “float32”], {}], “config”: {“index”: 11386728, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 4, 16, 1]], [“tile_y”, “sp”, [-1, 1, 2, 2]], [“tile_x”, “sp”, [-1, 1, 4, 1]], [“tile_rc”, “sp”, [-1, 8]], [“tile_ry”, “sp”, [-1, 1]], [“tile_rx”, “sp”, [-1, 1]], [“auto_unroll_max_step”, “ot”, 0], [“unroll_explicit”, “ot”, 1]]}, “result”: [[6.228479999999999e-05], 0, 0.7982416152954102, 1589267265.0845234], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw_winograd.cuda”, [[“TENSOR”, [1, 3, 608, 608], “float32”], [“TENSOR”, [16, 3, 7, 7], “float32”], [1, 1], [3, 3, 3, 3], [1, 1], “float32”], {}], “config”: {“index”: 331287, “code_hash”: null, “entity”: [[“tile_b”, “sp”, [-1, 1, 1, 1]], [“tile_y”, “sp”, [-1, 1, 8, 1]], [“tile_x”, “sp”, [-1, 38, 16, 2]], [“tile_rc”, “sp”, [-1, 3]], [“auto_unroll_max_step”, “ot”, 1500], [“unroll_explicit”, “ot”, 1]]}, "result": [[0.07371293322794117], 0, 251.75588297843933, 1589281670.814984], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 3, 608, 608], “float32”], [“TENSOR”, [16, 3, 7, 7], “float32”], [1, 1], [3, 3, 3, 3], [1, 1], “float32”], {}], “config”: {“index”: 45898438, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 1, 2, 8]], [“tile_y”, “sp”, [-1, 1, 8, 2]], [“tile_x”, “sp”, [-1, 1, 8, 1]], [“tile_rc”, “sp”, [-1, 1]], [“tile_ry”, “sp”, [-1, 7]], [“tile_rx”, “sp”, [-1, 1]], [“auto_unroll_max_step”, “ot”, 0], [“unroll_explicit”, “ot”, 1]]}, “result”: [[0.0006671148], 0, 0.9300704002380371, 1589286987.3254607], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw_winograd.cuda”, [[“TENSOR”, [1, 16, 608, 608], “float32”], [“TENSOR”, [16, 16, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “float32”], {}], “config”: {“index”: 92131, “code_hash”: null, “entity”: [[“tile_b”, “sp”, [-1, 1, 1, 1]], [“tile_y”, “sp”, [-1, 4, 4, 1]], [“tile_x”, “sp”, [-1, 4, 76, 1]], [“tile_rc”, “sp”, [-1, 8]], [“auto_unroll_max_step”, “ot”, 0], [“unroll_explicit”, “ot”, 0]]}, “result”: [[0.0011753858], 0, 2.777233600616455, 1589288302.142365], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 16, 608, 608], “float32”], [“TENSOR”, [16, 16, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “float32”], {}], “config”: {“index”: 202675911, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 4, 4, 1]], [“tile_y”, “sp”, [-1, 1, 2, 4]], [“tile_x”, “sp”, [-1, 1, 16, 2]], [“tile_rc”, “sp”, [-1, 1]], [“tile_ry”, “sp”, [-1, 3]], [“tile_rx”, “sp”, [-1, 3]], [“auto_unroll_max_step”, “ot”, 1500], [“unroll_explicit”, “ot”, 1]]}, “result”: [[0.0004754108], 0, 0.9961037635803223, 1589292253.6476617], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 16, 608, 608], “float32”], [“TENSOR”, [32, 16, 3, 3], “float32”], [2, 2], [1, 1, 1, 1], [1, 1], “float32”], {}], “config”: {“index”: 105832821, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 1, 8, 4]], [“tile_y”, “sp”, [-1, 2, 2, 1]], [“tile_x”, “sp”, [-1, 1, 4, 2]], [“tile_rc”, “sp”, [-1, 2]], [“tile_ry”, “sp”, [-1, 3]], [“tile_rx”, “sp”, [-1, 3]], [“auto_unroll_max_step”, “ot”, 512], [“unroll_explicit”, “ot”, 1]]}, “result”: [[0.00031980699999999997], 0, 0.9922704696655273, 1589294893.7663195], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 32, 304, 304], “float32”], [“TENSOR”, [64, 32, 3, 3], “float32”], [2, 2], [1, 1, 1, 1], [1, 1], “float32”], {}], “config”: {“index”: 74996003, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 2, 16, 2]], [“tile_y”, “sp”, [-1, 2, 2, 1]], [“tile_x”, “sp”, [-1, 1, 4, 2]], [“tile_rc”, “sp”, [-1, 2]], [“tile_ry”, “sp”, [-1, 3]], [“tile_rx”, “sp”, [-1, 3]], [“auto_unroll_max_step”, “ot”, 1500], [“unroll_explicit”, “ot”, 1]]}, “result”: [[0.0002477788], 0, 0.7339715957641602, 1589297709.4476457], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw_winograd.cuda”, [[“TENSOR”, [1, 64, 152, 152], “float32”], [“TENSOR”, [64, 64, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “float32”], {}], “config”: {“index”: 35742, “code_hash”: null, “entity”: [[“tile_b”, “sp”, [-1, 1, 1, 1]], [“tile_y”, “sp”, [-1, 8, 4, 2]], [“tile_x”, “sp”, [-1, 2, 38, 1]], [“tile_rc”, “sp”, [-1, 16]], [“auto_unroll_max_step”, “ot”, 0], [“unroll_explicit”, “ot”, 0]]}, “result”: [[0.0003424414], 0, 2.269379138946533, 1589301561.3000855], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 64, 152, 152], “float32”], [“TENSOR”, [64, 64, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “float32”], {}], “config”: {“index”: 84138815, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 2, 16, 2]], [“tile_y”, “sp”, [-1, 1, 2, 4]], [“tile_x”, “sp”, [-1, 1, 4, 2]], [“tile_rc”, “sp”, [-1, 4]], [“tile_ry”, “sp”, [-1, 1]], [“tile_rx”, “sp”, [-1, 3]], [“auto_unroll_max_step”, “ot”, 1500], [“unroll_explicit”, “ot”, 1]]}, “result”: [[0.0003803212], 0, 2.121492385864258, 1589307541.1549618], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 128, 152, 152], “float32”], [“TENSOR”, [64, 128, 1, 1], “float32”], [1, 1], [0, 0, 0, 0], [1, 1], “float32”], {}], “config”: {“index”: 23926248, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 4, 16, 1]], [“tile_y”, “sp”, [-1, 1, 2, 2]], [“tile_x”, “sp”, [-1, 1, 4, 2]], [“tile_rc”, “sp”, [-1, 16]], [“tile_ry”, “sp”, [-1, 1]], [“tile_rx”, “sp”, [-1, 1]], [“auto_unroll_max_step”, “ot”, 1500], [“unroll_explicit”, “ot”, 1]]}, “result”: [[0.000132805], 0, 0.9597022533416748, 1589309691.048028], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw_winograd.cuda”, [[“TENSOR”, [1, 128, 152, 152], “float32”], [“TENSOR”, [64, 128, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “float32”], {}], “config”: {“index”: 102942, “code_hash”: null, “entity”: [[“tile_b”, “sp”, [-1, 1, 1, 1]], [“tile_y”, “sp”, [-1, 8, 4, 2]], [“tile_x”, “sp”, [-1, 2, 38, 1]], [“tile_rc”, “sp”, [-1, 16]], [“auto_unroll_max_step”, “ot”, 128], [“unroll_explicit”, “ot”, 0]]}, “result”: [[0.0005243506], 0, 2.801717758178711, 1589311534.292469], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 128, 152, 152], “float32”], [“TENSOR”, [64, 128, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “float32”], {}], “config”: {“index”: 48198555, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 1, 64, 1]], [“tile_y”, “sp”, [-1, 4, 1, 2]], [“tile_x”, “sp”, [-1, 1, 2, 4]], [“tile_rc”, “sp”, [-1, 2]], [“tile_ry”, “sp”, [-1, 3]], [“tile_rx”, “sp”, [-1, 3]], [“auto_unroll_max_step”, “ot”, 1500], [“unroll_explicit”, “ot”, 0]]}, “result”: [[0.0005996624], 0, 1.1435577869415283, 1589316195.7178435], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw_winograd.cuda”, [[“TENSOR”, [1, 64, 152, 152], “float32”], [“TENSOR”, [256, 64, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “float32”], {}], “config”: {“index”: 532189, “code_hash”: null, “entity”: [[“tile_b”, “sp”, [-1, 1, 1, 1]], [“tile_y”, “sp”, [-1, 16, 4, 2]], [“tile_x”, “sp”, [-1, 2, 38, 1]], [“tile_rc”, “sp”, [-1, 16]], [“auto_unroll_max_step”, “ot”, 128], [“unroll_explicit”, “ot”, 1]]}, “result”: [[0.0008888456], 0, 2.8196074962615967, 1589320086.2292197], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 64, 152, 152], “float32”], [“TENSOR”, [256, 64, 3, 3], “float32”], [1, 1], [1, 1, 1, 1], [1, 1], “float32”], {}], “config”: {“index”: 82904121, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 2, 32, 1]], [“tile_y”, “sp”, [-1, 2, 1, 4]], [“tile_x”, “sp”, [-1, 1, 4, 2]], [“tile_rc”, “sp”, [-1, 2]], [“tile_ry”, “sp”, [-1, 3]], [“tile_rx”, “sp”, [-1, 3]], [“auto_unroll_max_step”, “ot”, 1500], [“unroll_explicit”, “ot”, 0]]}, “result”: [[0.001153614], 0, 1.1703996658325195, 1589324333.4225006], “version”: 0.2, “tvm_version”: “0.7.dev1”} {“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 256, 152, 152], “float32”], [“TENSOR”, [7, 256, 1, 1], “float32”], [1, 1], [0, 0, 0, 0], [1, 1], “float32”], {}], “config”: {“index”: 521403, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 1, 1, 7]], [“tile_y”, “sp”, [-1, 1, 1, 2]], [“tile_x”, “sp”, [-1, 1, 152, 1]], [“tile_rc”, “sp”, [-1, 4]], [“tile_ry”, “sp”, [-1, 1]], [“tile_rx”, “sp”, [-1, 1]], [“auto_unroll_max_step”, “ot”, 1500], [“unroll_explicit”, “ot”, 0]]}, “result”: [[0.0001050018], 0, 0.783576488494873, 1589327811.3064663], “version”: 0.2, “tvm_version”: “0.7.dev1”}

  1. 500 trials is insufficient for GPU. You should first try 3,000 or even 4,000.

  2. For the workload you marked ([[“TENSOR”, [1, 3, 608, 608], “float32”], [“TENSOR”, [16, 3, 7, 7],*** “float32”], [1, 1], [3, 3, 3, 3], [1, 1], “float32”]), it actually has two implementations:

  • Winograd conv2d. Latency 0.07 sec.

{“input”: [“cuda -model=unknown”, “conv2d_nchw_winograd.cuda”, [[“TENSOR”, [1, 3, 608, 608], “float32”], [“TENSOR”, [16, 3, 7, 7], “float32”], [1, 1], [3, 3, 3, 3], [1, 1], “float32”], {}], “config”: {“index”: 331287, “code_hash”: null, “entity”: [[“tile_b”, “sp”, [-1, 1, 1, 1]], [“tile_y”, “sp”, [-1, 1, 8, 1]], [“tile_x”, “sp”, [-1, 38, 16, 2]], [“tile_rc”, “sp”, [-1, 3]], [“auto_unroll_max_step”, “ot”, 1500], [“unroll_explicit”, “ot”, 1]]}, "result": [[0.07371293322794117], 0, 251.75588297843933, 1589281670.814984], “version”: 0.2, “tvm_version”: “0.7.dev1”}

  • Direct conv2d. Latency 0.00067 sec.

{“input”: [“cuda -model=unknown”, “conv2d_nchw.cuda”, [[“TENSOR”, [1, 3, 608, 608], “float32”], [“TENSOR”, [16, 3, 7, 7], “float32”], [1, 1], [3, 3, 3, 3], [1, 1], “float32”], {}], “config”: {“index”: 45898438, “code_hash”: null, “entity”: [[“tile_f”, “sp”, [-1, 1, 2, 8]], [“tile_y”, “sp”, [-1, 1, 8, 2]], [“tile_x”, “sp”, [-1, 1, 8, 1]], [“tile_rc”, “sp”, [-1, 1]], [“tile_ry”, “sp”, [-1, 7]], [“tile_rx”, “sp”, [-1, 1]], [“auto_unroll_max_step”, “ot”, 0], [“unroll_explicit”, “ot”, 1]]}, “result”: [[0.0006671148], 0, 0.9300704002380371, 1589286987.3254607], “version”: 0.2, “tvm_version”: “0.7.dev1”}

It means that your workload is more suitable for direct conv2d instead of Winograd. Note that when you build the model with this log, for each workload, AutoTVM will choose one log with the best performance. As a result, you can just ignore the Winograd one and this workload should run for only 0.00067 seconds in your model.

“As a result, you can just ignore the Winograd one”

What should I do to ignore this Winograd?This log is the 43 layer optimal log generated finally. The two layers mentioned above should be two different layers?

请问这个我应该怎么操作才能忽略它Winograd?这个日志是最后生成的43层最优的日志,您上面说的两层应该是两个不一样的层?

They are the same layer because they have the same workload. You found two lines in the optimal log because they are different “tuning tasks”.

When you build the model, you should have the following:

    with autotvm.apply_history_best(log_file):
        print("Compile...")
        with relay.build_config(opt_level=3):
            graph, lib, params = relay.build_module.build(
                mod, target=target, params=params)

The autotvm.apply_history_best(log_file) will select the best task log if you have many in the log file. It means, you don’t have to do anything. You can also try to manually remove the Winograd line and build the model again. It should give you the same end-to-end performance.

Thus, if the end-to-end model performance is not good enough, it might not due to the layer you pointed out. You should try to use debug runtime to locate the performance bottleneck first.

“You should try to use debug runtime to locate the performance bottleneck first.”

I made the operation code above for you,What should I do next? A head of Misty water

我是按造您上面的操作代码的,接下来我应该怎么做呢?一头雾水。