[AutoTVM] Resnet50 and MobileNetv2 after auto-tvm tuning is much slower than the optimized assembly code on ARM Cortex A53

Hi. When I use auto-tvm to tune Resnet50 and MobileNetv2 on ARM CPU(single A53 on rk3399), I find that the inference time is not better than the code optimized with assembly ourselves. Resnet50 inference time on the ARM Cortex A53 is 1.036x slower and MobileNetv2 inference time is 1.613x slower. And after profiling, I find the main reasons:

  1. TVM conv2d time cost: conv1(1) is 1.3x slower and depthwise conv3(1) is 2.9x slower compared with our code. Workloads will be attached in the end.
  2. I take res.costs in the tuning log as the time cost of conv2d. The time ratio of conv2d in TVM is 72% and 69% for Resnet50 and MobileNetv2, while the time ratio of conv in our code is 96.3% and 85.3% for Resnet50 and MobileNetv2. I wonder if it contains the cost of data transformation and so on. Why is the time ratio of conv2d in TVM so much lower? Do I get something wrong when profiling?

How can I improve the performance result? Could you give some advices? Thanks in advance.

Test Environment: single A53 on rk3399, frequency is fixed to be 1.008 GHz
Model: Resnet50
the code optimized with assembly ourselves (ms): 2462
TVM autotune(n_trial=2000, early_stopping=1000, opt_level=3, timeout=1e9) (ms):2549.65
Speedup: 0.9656x

Model: MobileNetv2
the code optimized with assembly ourselves (ms): 244.82
TVM autotune(n_trial=2000, early_stopping=1000, opt_level=3, timeout=1e9) (ms):394.89
Speedup: 0.62x

Resnet50 tune log(after pick_best):

{“i”: [“llvm -device=arm_cpu -model=rk3399 -target=aarch64-linux-gnu”, “topi_nn_conv2d”, [[“TENSOR”, [1, 3, 224, 224], “float32”], [“TENSOR”, [64, 3, 7, 7], “float32”], [2, 2], [3, 3], “NCHW”, “float32”], {}, [“conv2d”, [1, 3, 224, 224, “float32”], [64, 3, 7, 7, “float32”], [2, 2], [3, 3], “NCHW”, “float32”], {“i”: 77282, “c”: null, “e”: [[“tile_co”, “sp”, [16, 4]], [“tile_oh”, “sp”, [112, 1]], [“tile_ow”, “sp”, [14, 8]], [“reorder_0”, “re”, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]], [“ann_reduce”, “an”, [“unroll”, “unroll”]], [“ann_spatial”, “an”, [“unroll”, “unroll”, “vec”]]], “t”: “direct”}], “r”: [[0.074890018], 0, 1.7477071285247803, 1539139066.319467], “v”: 0.1}
{“i”: [“llvm -device=arm_cpu -model=rk3399 -target=aarch64-linux-gnu”, “topi_nn_conv2d”, [[“TENSOR”, [1, 64, 56, 56], “float32”], [“TENSOR”, [64, 64, 1, 1], “float32”], [1, 1], [0, 0], “NCHW”, “float32”], {}, [“conv2d”, [1, 64, 56, 56, “float32”], [64, 64, 1, 1, “float32”], [1, 1], [0, 0], “NCHW”, “float32”], {“i”: 20001, “c”: null, “e”: [[“tile_co”, “sp”, [16, 4]], [“tile_oh”, “sp”, [28, 2]], [“tile_ow”, “sp”, [4, 14]], [“reorder_0”, “re”, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]], [“ann_reduce”, “an”, [“unroll”, “none”]], [“ann_spatial”, “an”, [“none”, “unroll”, “vec”]]], “t”: “direct”}], “r”: [[0.010285772], 0, 0.3331129550933838, 1539141548.854048], “v”: 0.1}
{“i”: [“llvm -device=arm_cpu -model=rk3399 -target=aarch64-linux-gnu”, “topi_nn_conv2d”, [[“TENSOR”, [1, 64, 56, 56], “float32”], [“TENSOR”, [64, 64, 3, 3], “float32”], [1, 1], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 64, 56, 56, “float32”], [64, 64, 3, 3, “float32”], [1, 1], [1, 1], “NCHW”, “float32”], {“i”: 712, “c”: null, “e”: [[“tile_p”, “sp”, [49, 4]], [“tile_k”, “sp”, [16, 4]], [“tile_c”, “sp”, [8, 8]], [“ann_reduce”, “an”, [“unroll”]], [“ann_spatial”, “an”, [“none”, “vec”]]], “t”: “winograd”}], “r”: [[0.04315077525], 0, 1.4104440212249756, 1539142745.342937], “v”: 0.1}
{“i”: [“llvm -device=arm_cpu -model=rk3399 -target=aarch64-linux-gnu”, “topi_nn_conv2d”, [[“TENSOR”, [1, 64, 56, 56], “float32”], [“TENSOR”, [256, 64, 1, 1], “float32”], [1, 1], [0, 0], “NCHW”, “float32”], {}, [“conv2d”, [1, 64, 56, 56, “float32”], [256, 64, 1, 1, “float32”], [1, 1], [0, 0], “NCHW”, “float32”], {“i”: 24483, “c”: null, “e”: [[“tile_co”, “sp”, [32, 8]], [“tile_oh”, “sp”, [56, 1]], [“tile_ow”, “sp”, [7, 8]], [“reorder_0”, “re”, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]], [“ann_reduce”, “an”, [“none”, “unroll”]], [“ann_spatial”, “an”, [“none”, “unroll”, “vec”]]], “t”: “direct”}], “r”: [[0.03853981625], 0, 0.5232648849487305, 1539145301.373015], “v”: 0.1}
{“i”: [“llvm -device=arm_cpu -model=rk3399 -target=aarch64-linux-gnu”, “topi_nn_conv2d”, [[“TENSOR”, [1, 256, 56, 56], “float32”], [“TENSOR”, [64, 256, 1, 1], “float32”], [1, 1], [0, 0], “NCHW”, “float32”], {}, [“conv2d”, [1, 256, 56, 56, “float32”], [64, 256, 1, 1, “float32”], [1, 1], [0, 0], “NCHW”, “float32”], {“i”: 36067, “c”: null, “e”: [[“tile_co”, “sp”, [8, 8]], [“tile_oh”, “sp”, [56, 1]], [“tile_ow”, “sp”, [7, 8]], [“reorder_0”, “re”, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]], [“ann_reduce”, “an”, [“none”, “none”]], [“ann_spatial”, “an”, [“unroll”, “none”, “vec”]]], “t”: “direct”}], “r”: [[0.03775319125], 0, 0.6328468322753906, 1539147075.391883], “v”: 0.1}
{“i”: [“llvm -device=arm_cpu -model=rk3399 -target=aarch64-linux-gnu”, “topi_nn_conv2d”, [[“TENSOR”, [1, 256, 56, 56], “float32”], [“TENSOR”, [128, 256, 1, 1], “float32”], [2, 2], [0, 0], “NCHW”, “float32”], {}, [“conv2d”, [1, 256, 56, 56, “float32”], [128, 256, 1, 1, “float32”], [2, 2], [0, 0], “NCHW”, “float32”], {“i”: 30723, “c”: null, “e”: [[“tile_co”, “sp”, [16, 8]], [“tile_oh”, “sp”, [28, 1]], [“tile_ow”, “sp”, [2, 14]], [“reorder_0”, “re”, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]], [“ann_reduce”, “an”, [“none”, “unroll”]], [“ann_spatial”, “an”, [“unroll”, “unroll”, “vec”]]], “t”: “direct”}], “r”: [[0.01797935575], 0, 0.456265926361084, 1539148968.302433], “v”: 0.1}
{“i”: [“llvm -device=arm_cpu -model=rk3399 -target=aarch64-linux-gnu”, “topi_nn_conv2d”, [[“TENSOR”, [1, 128, 28, 28], “float32”], [“TENSOR”, [128, 128, 3, 3], “float32”], [1, 1], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 128, 28, 28, “float32”], [128, 128, 3, 3, “float32”], [1, 1], [1, 1], “NCHW”, “float32”], {“i”: 695, “c”: null, “e”: [[“tile_p”, “sp”, [7, 7]], [“tile_k”, “sp”, [32, 4]], [“tile_c”, “sp”, [8, 16]], [“ann_reduce”, “an”, [“unroll”]], [“ann_spatial”, “an”, [“vec”, “none”]]], “t”: “winograd”}], “r”: [[0.035037045], 0, 1.7447781562805176, 1539150881.999021], “v”: 0.1}
{“i”: [“llvm -device=arm_cpu -model=rk3399 -target=aarch64-linux-gnu”, “topi_nn_conv2d”, [[“TENSOR”, [1, 128, 28, 28], “float32”], [“TENSOR”, [512, 128, 1, 1], “float32”], [1, 1], [0, 0], “NCHW”, “float32”], {}, [“conv2d”, [1, 128, 28, 28, “float32”], [512, 128, 1, 1, “float32”], [1, 1], [0, 0], “NCHW”, “float32”], {“i”: 38312, “c”: null, “e”: [[“tile_co”, “sp”, [128, 4]], [“tile_oh”, “sp”, [4, 7]], [“tile_ow”, “sp”, [7, 4]], [“reorder_0”, “re”, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]], [“ann_reduce”, “an”, [“none”, “unroll”]], [“ann_spatial”, “an”, [“unroll”, “unroll”, “vec”]]], “t”: “direct”}], “r”: [[0.0323195865], 0, 0.6552119255065918, 1539153410.616602], “v”: 0.1}
{“i”: [“llvm -device=arm_cpu -model=rk3399 -target=aarch64-linux-gnu”, “topi_nn_conv2d”, [[“TENSOR”, [1, 256, 56, 56], “float32”], [“TENSOR”, [512, 256, 1, 1], “float32”], [2, 2], [0, 0], “NCHW”, “float32”], {}, [“conv2d”, [1, 256, 56, 56, “float32”], [512, 256, 1, 1, “float32”], [2, 2], [0, 0], “NCHW”, “float32”], {“i”: 16083, “c”: null, “e”: [[“tile_co”, “sp”, [64, 8]], [“tile_oh”, “sp”, [28, 1]], [“tile_ow”, “sp”, [2, 14]], [“reorder_0”, “re”, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]], [“ann_reduce”, “an”, [“unroll”, “none”]], [“ann_spatial”, “an”, [“none”, “unroll”, “vec”]]], “t”: “direct”}], “r”: [[0.068800382], 0, 0.797644853591919, 1539156560.553953], “v”: 0.1}
{“i”: [“llvm -device=arm_cpu -model=rk3399 -target=aarch64-linux-gnu”, “topi_nn_conv2d”, [[“TENSOR”, [1, 512, 28, 28], “float32”], [“TENSOR”, [128, 512, 1, 1], “float32”], [1, 1], [0, 0], “NCHW”, “float32”], {}, [“conv2d”, [1, 512, 28, 28, “float32”], [128, 512, 1, 1, “float32”], [1, 1], [0, 0], “NCHW”, “float32”], {“i”: 12109, “c”: null, “e”: [[“tile_co”, “sp”, [4, 32]], [“tile_oh”, “sp”, [14, 2]], [“tile_ow”, “sp”, [28, 1]], [“reorder_0”, “re”, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]], [“ann_reduce”, “an”, [“none”, “unroll”]], [“ann_spatial”, “an”, [“none”, “unroll”, “vec”]]], “t”: “direct”}], “r”: [[0.03443890975], 0, 0.6284029483795166, 1539159635.885853], “v”: 0.1}
{“i”: [“llvm -device=arm_cpu -model=rk3399 -target=aarch64-linux-gnu”, “topi_nn_conv2d”, [[“TENSOR”, [1, 512, 28, 28], “float32”], [“TENSOR”, [256, 512, 1, 1], “float32”], [2, 2], [0, 0], “NCHW”, “float32”], {}, [“conv2d”, [1, 512, 28, 28, “float32”], [256, 512, 1, 1, “float32”], [2, 2], [0, 0], “NCHW”, “float32”], {“i”: 5871, “c”: null, “e”: [[“tile_co”, “sp”, [32, 8]], [“tile_oh”, “sp”, [14, 1]], [“tile_ow”, “sp”, [1, 14]], [“reorder_0”, “re”, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]], [“ann_reduce”, “an”, [“none”, “none”]], [“ann_spatial”, “an”, [“none”, “unroll”, “vec”]]], “t”: “direct”}], “r”: [[0.01643840775], 0, 0.542029857635498, 1539162443.906005], “v”: 0.1}
{“i”: [“llvm -device=arm_cpu -model=rk3399 -target=aarch64-linux-gnu”, “topi_nn_conv2d”, [[“TENSOR”, [1, 256, 14, 14], “float32”], [“TENSOR”, [256, 256, 3, 3], “float32”], [1, 1], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 256, 14, 14, “float32”], [256, 256, 3, 3, “float32”], [1, 1], [1, 1], “NCHW”, “float32”], {“i”: 518, “c”: null, “e”: [[“tile_p”, “sp”, [2, 8]], [“tile_k”, “sp”, [32, 8]], [“tile_c”, “sp”, [256, 1]], [“ann_reduce”, “an”, [“none”]], [“ann_spatial”, “an”, [“none”, “vec”]]], “t”: “winograd”}], “r”: [[0.0290548155], 0, 1.66587495803833, 1539165118.931159], “v”: 0.1}
{“i”: [“llvm -device=arm_cpu -model=rk3399 -target=aarch64-linux-gnu”, “topi_nn_conv2d”, [[“TENSOR”, [1, 256, 14, 14], “float32”], [“TENSOR”, [1024, 256, 1, 1], “float32”], [1, 1], [0, 0], “NCHW”, “float32”], {}, [“conv2d”, [1, 256, 14, 14, “float32”], [1024, 256, 1, 1, “float32”], [1, 1], [0, 0], “NCHW”, “float32”], {“i”: 18449, “c”: null, “e”: [[“tile_co”, “sp”, [256, 4]], [“tile_oh”, “sp”, [7, 2]], [“tile_ow”, “sp”, [1, 14]], [“reorder_0”, “re”, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]], [“ann_reduce”, “an”, [“none”, “none”]], [“ann_spatial”, “an”, [“unroll”, “unroll”, “vec”]]], “t”: “direct”}], “r”: [[0.03087116975], 0, 0.6397318840026855, 1539169079.866169], “v”: 0.1}
{“i”: [“llvm -device=arm_cpu -model=rk3399 -target=aarch64-linux-gnu”, “topi_nn_conv2d”, [[“TENSOR”, [1, 512, 28, 28], “float32”], [“TENSOR”, [1024, 512, 1, 1], “float32”], [2, 2], [0, 0], “NCHW”, “float32”], {}, [“conv2d”, [1, 512, 28, 28, “float32”], [1024, 512, 1, 1, “float32”], [2, 2], [0, 0], “NCHW”, “float32”], {“i”: 4007, “c”: null, “e”: [[“tile_co”, “sp”, [128, 8]], [“tile_oh”, “sp”, [14, 1]], [“tile_ow”, “sp”, [1, 14]], [“reorder_0”, “re”, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]], [“ann_reduce”, “an”, [“unroll”, “unroll”]], [“ann_spatial”, “an”, [“none”, “none”, “vec”]]], “t”: “direct”}], “r”: [[0.06387566275], 0, 0.8418159484863281, 1539171836.658565], “v”: 0.1}
{“i”: [“llvm -device=arm_cpu -model=rk3399 -target=aarch64-linux-gnu”, “topi_nn_conv2d”, [[“TENSOR”, [1, 1024, 14, 14], “float32”], [“TENSOR”, [256, 1024, 1, 1], “float32”], [1, 1], [0, 0], “NCHW”, “float32”], {}, [“conv2d”, [1, 1024, 14, 14, “float32”], [256, 1024, 1, 1, “float32”], [1, 1], [0, 0], “NCHW”, “float32”], {“i”: 3182, “c”: null, “e”: [[“tile_co”, “sp”, [8, 32]], [“tile_oh”, “sp”, [7, 2]], [“tile_ow”, “sp”, [14, 1]], [“reorder_0”, “re”, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]], [“ann_reduce”, “an”, [“unroll”, “unroll”]], [“ann_spatial”, “an”, [“none”, “none”, “vec”]]], “t”: “direct”}], “r”: [[0.0322977115], 0, 1.2555630207061768, 1539175528.747063], “v”: 0.1}
{“i”: [“llvm -device=arm_cpu -model=rk3399 -target=aarch64-linux-gnu”, “topi_nn_conv2d”, [[“TENSOR”, [1, 1024, 14, 14], “float32”], [“TENSOR”, [512, 1024, 1, 1], “float32”], [2, 2], [0, 0], “NCHW”, “float32”], {}, [“conv2d”, [1, 1024, 14, 14, “float32”], [512, 1024, 1, 1, “float32”], [2, 2], [0, 0], “NCHW”, “float32”], {“i”: 1863, “c”: null, “e”: [[“tile_co”, “sp”, [64, 8]], [“tile_oh”, “sp”, [7, 1]], [“tile_ow”, “sp”, [1, 7]], [“reorder_0”, “re”, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]], [“ann_reduce”, “an”, [“unroll”, “unroll”]], [“ann_spatial”, “an”, [“none”, “unroll”, “vec”]]], “t”: “direct”}], “r”: [[0.016050637], 0, 0.6193010807037354, 1539178288.041898], “v”: 0.1}
{“i”: [“llvm -device=arm_cpu -model=rk3399 -target=aarch64-linux-gnu”, “topi_nn_conv2d”, [[“TENSOR”, [1, 512, 7, 7], “float32”], [“TENSOR”, [512, 512, 3, 3], “float32”], [1, 1], [1, 1], “NCHW”, “float32”], {}, [“conv2d”, [1, 512, 7, 7, “float32”], [512, 512, 3, 3, “float32”], [1, 1], [1, 1], “NCHW”, “float32”], {“i”: 404, “c”: null, “e”: [[“tile_p”, “sp”, [1, 4]], [“tile_k”, “sp”, [32, 16]], [“tile_c”, “sp”, [256, 2]], [“ann_reduce”, “an”, [“unroll”]], [“ann_spatial”, “an”, [“none”, “vec”]]], “t”: “winograd”}], “r”: [[0.02539016925], 0, 3.192999839782715, 1539180726.346742], “v”: 0.1}
{“i”: [“llvm -device=arm_cpu -model=rk3399 -target=aarch64-linux-gnu”, “topi_nn_conv2d”, [[“TENSOR”, [1, 512, 7, 7], “float32”], [“TENSOR”, [2048, 512, 1, 1], “float32”], [1, 1], [0, 0], “NCHW”, “float32”], {}, [“conv2d”, [1, 512, 7, 7, “float32”], [2048, 512, 1, 1, “float32”], [1, 1], [0, 0], “NCHW”, “float32”], {“i”: 4155, “c”: null, “e”: [[“tile_co”, “sp”, [256, 8]], [“tile_oh”, “sp”, [7, 1]], [“tile_ow”, “sp”, [1, 7]], [“reorder_0”, “re”, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]], [“ann_reduce”, “an”, [“unroll”, “unroll”]], [“ann_spatial”, “an”, [“unroll”, “none”, “vec”]]], “t”: “direct”}], “r”: [[0.03523997225], 0, 0.71940016746521, 1539184531.843269], “v”: 0.1}
{“i”: [“llvm -device=arm_cpu -model=rk3399 -target=aarch64-linux-gnu”, “topi_nn_conv2d”, [[“TENSOR”, [1, 1024, 14, 14], “float32”], [“TENSOR”, [2048, 1024, 1, 1], “float32”], [2, 2], [0, 0], “NCHW”, “float32”], {}, [“conv2d”, [1, 1024, 14, 14, “float32”], [2048, 1024, 1, 1, “float32”], [2, 2], [0, 0], “NCHW”, “float32”], {“i”: 5259, “c”: null, “e”: [[“tile_co”, “sp”, [256, 8]], [“tile_oh”, “sp”, [7, 1]], [“tile_ow”, “sp”, [1, 7]], [“reorder_0”, “re”, [0, 1, 2, 3, 4, 5, 6, 9, 7, 8]], [“ann_reduce”, “an”, [“unroll”, “none”]], [“ann_spatial”, “an”, [“unroll”, “unroll”, “vec”]]], “t”: “direct”}], “r”: [[0.06342715225], 0, 0.8785848617553711, 1539187590.13863], “v”: 0.1}
{“i”: [“llvm -device=arm_cpu -model=rk3399 -target=aarch64-linux-gnu”, “topi_nn_conv2d”, [[“TENSOR”, [1, 2048, 7, 7], “float32”], [“TENSOR”, [512, 2048, 1, 1], “float32”], [1, 1], [0, 0], “NCHW”, “float32”], {}, [“conv2d”, [1, 2048, 7, 7, “float32”], [512, 2048, 1, 1, “float32”], [1, 1], [0, 0], “NCHW”, “float32”], {“i”: 1623, “c”: null, “e”: [[“tile_co”, “sp”, [64, 8]], [“tile_oh”, “sp”, [7, 1]], [“tile_ow”, “sp”, [1, 7]], [“reorder_0”, “re”, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]], [“ann_reduce”, “an”, [“none”, “none”]], [“ann_spatial”, “an”, [“none”, “unroll”, “vec”]]], “t”: “direct”}], “r”: [[0.03507117025], 0, 0.7908809185028076, 1539191122.870224], “v”: 0.1}

Due to the limit of characters in one post, Mobilenetv2 tune log is omitted.

1 Like

Have you tried to compare the generated code and manually wroten code?

I haven’t compared the assembly code generated by TVM with our assembly code. I have plan to try, but I want to sync with you first and make sure the result meets performance expectation.

I think your data should be true. According to this table: https://tvm.ai/2018/10/03/auto-opt-all.html. Wish you could share more information.

I think your result matches our expectation. It seems your assembly code is pretty good. How is it compared to other libraries such as NCNN, TF lite?

  1. There is still much room for improvement especially for 1x1 convolution and depthwise convolution, as you pointed out. @FrozenGene maybe can share their optimizations later this week in a blog post.
  2. Some layers appear many times. So the total time cost of conv2d should be a weighted sum over the log. But now our interface doesn’t output the repeat number of an individual layer. You can count it manually.

Well-optimized assembly is still very hard to beat. If you are interested, you can try to inline your assembly code into TVM functions. So you can benefit from both tvm framework and well-optimized assembly. This shows promising results according to https://github.com/dmlc/tvm/issues/1596#issuecomment-415894262

@FrozenGene could you share the test environment on Firefly RK3399 posted in https://tvm.ai/2018/10/03/auto-opt-all.html? A53 or A72, multi-core or not, frequency and autotune config. We can compare with our results in the same test environment.

@merrymercy It’s better than TF lite according to our early test results.

  1. I look forward to the optimizations on 1x1 convolution and depthwise convolution. If it can achieve comparable performance with well-optimized assembly, it will reduce the overhead of manual optimization.
  2. I have already counted the total time cost of conv2d considering the repeat number of an individual layer. For example, the total time cost of conv2d in Resnet50 is 1836.68 ms while the total time cost of Resnet50 is 2549.65 ms. I plan to perform layer-wise profiling of Resnet50 and MobileNetv2 constructed using NNVM Symbols.

@merrymercy could answer the question of environment

See benchmark here https://github.com/dmlc/tvm/wiki/Benchmark#arm-cpu
TVM will download tuned configs to ~/.tvm/tophub/arm_cpu_v0.03.log

For layer-wise profiling, try this debug mode https://docs.tvm.ai/dev/debugger.html?highlight=debug

@merrymercy
I will try the debug mode and will sync the results if we have further results. Thanks!

@merrymercy
After profiling with the debug mode, the time ratio of conv2d in TVM is above 95% for Resnet50 and MobileNetv2. Now the time cost of conv1(1) and depthwise conv3(1) is the main problem.

@sjtumdlong
Could you share the conv1(1) and depthwise conv3(1) workload? If it is heavy workload, spatial pack’s data transformation will occupy much time.

@FrozenGene
The workload of the conv1(1) and depthwise conv3(1) is as below. What does the heavy workload mean? In which situation will data transformation occur?

Resnet-50 workload:
(‘conv2d’, (1, 3, 224, 224, ‘float32’), (64, 3, 7, 7, ‘float32’), (2, 2), (3, 3), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 64, 56, 56, ‘float32’), (64, 64, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 64, 56, 56, ‘float32’), (64, 64, 3, 3, ‘float32’), (1, 1), (1, 1), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 64, 56, 56, ‘float32’), (256, 64, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 256, 56, 56, ‘float32’), (64, 256, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 256, 56, 56, ‘float32’), (128, 256, 1, 1, ‘float32’), (2, 2), (0, 0), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 128, 28, 28, ‘float32’), (128, 128, 3, 3, ‘float32’), (1, 1), (1, 1), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 128, 28, 28, ‘float32’), (512, 128, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 256, 56, 56, ‘float32’), (512, 256, 1, 1, ‘float32’), (2, 2), (0, 0), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 512, 28, 28, ‘float32’), (128, 512, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 512, 28, 28, ‘float32’), (256, 512, 1, 1, ‘float32’), (2, 2), (0, 0), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 256, 14, 14, ‘float32’), (256, 256, 3, 3, ‘float32’), (1, 1), (1, 1), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 256, 14, 14, ‘float32’), (1024, 256, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 512, 28, 28, ‘float32’), (1024, 512, 1, 1, ‘float32’), (2, 2), (0, 0), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 1024, 14, 14, ‘float32’), (256, 1024, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 1024, 14, 14, ‘float32’), (512, 1024, 1, 1, ‘float32’), (2, 2), (0, 0), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 512, 7, 7, ‘float32’), (512, 512, 3, 3, ‘float32’), (1, 1), (1, 1), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 512, 7, 7, ‘float32’), (2048, 512, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 1024, 14, 14, ‘float32’), (2048, 1024, 1, 1, ‘float32’), (2, 2), (0, 0), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 2048, 7, 7, ‘float32’), (512, 2048, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)

MobileNetv2 workload:
(‘conv2d’, (1, 3, 224, 224, ‘float32’), (32, 3, 3, 3, ‘float32’), (2, 2), (1, 1), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 32, 112, 112, ‘float32’), (32, 32, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘depthwise_conv2d_nchw’, (1, 32, 112, 112, ‘float32’), (32, 1, 3, 3, ‘float32’), (1, 1), (1, 1), ‘float32’)
(‘conv2d’, (1, 32, 112, 112, ‘float32’), (16, 32, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 16, 112, 112, ‘float32’), (96, 16, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘depthwise_conv2d_nchw’, (1, 96, 112, 112, ‘float32’), (96, 1, 3, 3, ‘float32’), (2, 2), (1, 1), ‘float32’)
(‘conv2d’, (1, 96, 56, 56, ‘float32’), (24, 96, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 24, 56, 56, ‘float32’), (144, 24, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘depthwise_conv2d_nchw’, (1, 144, 56, 56, ‘float32’), (144, 1, 3, 3, ‘float32’), (1, 1), (1, 1), ‘float32’)
(‘conv2d’, (1, 144, 56, 56, ‘float32’), (24, 144, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘depthwise_conv2d_nchw’, (1, 144, 56, 56, ‘float32’), (144, 1, 3, 3, ‘float32’), (2, 2), (1, 1), ‘float32’)
(‘conv2d’, (1, 144, 28, 28, ‘float32’), (32, 144, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 32, 28, 28, ‘float32’), (192, 32, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘depthwise_conv2d_nchw’, (1, 192, 28, 28, ‘float32’), (192, 1, 3, 3, ‘float32’), (1, 1), (1, 1), ‘float32’)
(‘conv2d’, (1, 192, 28, 28, ‘float32’), (32, 192, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘depthwise_conv2d_nchw’, (1, 192, 28, 28, ‘float32’), (192, 1, 3, 3, ‘float32’), (2, 2), (1, 1), ‘float32’)
(‘conv2d’, (1, 192, 14, 14, ‘float32’), (64, 192, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 64, 14, 14, ‘float32’), (384, 64, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘depthwise_conv2d_nchw’, (1, 384, 14, 14, ‘float32’), (384, 1, 3, 3, ‘float32’), (1, 1), (1, 1), ‘float32’)
(‘conv2d’, (1, 384, 14, 14, ‘float32’), (64, 384, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 384, 14, 14, ‘float32’), (96, 384, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 96, 14, 14, ‘float32’), (576, 96, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘depthwise_conv2d_nchw’, (1, 576, 14, 14, ‘float32’), (576, 1, 3, 3, ‘float32’), (1, 1), (1, 1), ‘float32’)
(‘conv2d’, (1, 576, 14, 14, ‘float32’), (96, 576, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘depthwise_conv2d_nchw’, (1, 576, 14, 14, ‘float32’), (576, 1, 3, 3, ‘float32’), (2, 2), (1, 1), ‘float32’)
(‘conv2d’, (1, 576, 7, 7, ‘float32’), (160, 576, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 160, 7, 7, ‘float32’), (960, 160, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘depthwise_conv2d_nchw’, (1, 960, 7, 7, ‘float32’), (960, 1, 3, 3, ‘float32’), (1, 1), (1, 1), ‘float32’)
(‘conv2d’, (1, 960, 7, 7, ‘float32’), (160, 960, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 960, 7, 7, ‘float32’), (320, 960, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 320, 7, 7, ‘float32’), (1280, 320, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)

@sjtumdlong You list the full list of workload . What I meant is could you list what type of workload that TVM slower than your assembly. For example, maybe 1x1 convolution like so on.

Heavy workload means the input data size of convolution is big. For example, 1x3x256x512 FP32 input is 16MB, which will occupy much time in the spatial pack’s transformation. In the schedule of spatial pack, we will do data transformation no matter what shape of input

@merrymercy I list some workload that TVM is much slower than our assembly.
conv1(1) workload:
(‘conv2d’, (1, 64, 56, 56, ‘float32’), (64, 64, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 64, 56, 56, ‘float32’), (256, 64, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 256, 56, 56, ‘float32’), (64, 256, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)
(‘conv2d’, (1, 128, 28, 28, ‘float32’), (512, 128, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’)

depthwise conv3(1) workload:
(‘depthwise_conv2d_nchw’, (1, 32, 112, 112, ‘float32’), (32, 1, 3, 3, ‘float32’), (1, 1), (1, 1), ‘float32’)
(‘depthwise_conv2d_nchw’, (1, 144, 56, 56, ‘float32’), (144, 1, 3, 3, ‘float32’), (1, 1), (1, 1), ‘float32’)
(‘depthwise_conv2d_nchw’, (1, 384, 14, 14, ‘float32’), (384, 1, 3, 3, ‘float32’), (1, 1), (1, 1), ‘float32’)
(‘depthwise_conv2d_nchw’, (1, 576, 14, 14, ‘float32’), (576, 1, 3, 3, ‘float32’), (1, 1), (1, 1), ‘float32’)

Have you the time comparation? seems that 1x1 convolution and depthwise convolution is slower.

@FrozenGene
I compare the time cost between fused convolution operators of TVM and our assembly code as below.
conv1(1) workload:
(‘conv2d’, (1, 64, 56, 56, ‘float32’), (64, 64, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’) 13.23ms vs. 8.33ms
(‘conv2d’, (1, 64, 56, 56, ‘float32’), (256, 64, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’) 41.34ms vs. 33.17ms
(‘conv2d’, (1, 256, 56, 56, ‘float32’), (64, 256, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’) 51.32ms vs. 34.81ms

depthwise conv3(1) workload:
(‘depthwise_conv2d_nchw’, (1, 32, 112, 112, ‘float32’), (32, 1, 3, 3, ‘float32’), (1, 1), (1, 1), ‘float32’) 17.75ms vs. 3.03ms
(‘depthwise_conv2d_nchw’, (1, 144, 56, 56, ‘float32’), (144, 1, 3, 3, ‘float32’), (1, 1), (1, 1), ‘float32’) 21.32ms vs. 3.35ms
(‘depthwise_conv2d_nchw’, (1, 384, 14, 14, ‘float32’), (384, 1, 3, 3, ‘float32’), (1, 1), (1, 1), ‘float32’) 3.75ms vs. 0.48ms
(‘depthwise_conv2d_nchw’, (1, 576, 14, 14, ‘float32’), (576, 1, 3, 3, ‘float32’), (1, 1), (1, 1), ‘float32’) 5.64ms vs. 1.02ms

Seems that depthwise conv2d has much room to improve. Have you seen related schedule HalideIR / assmebly code? How about your hand-writen assmebly code?

For conv2d, I discussed with @merrymercy last week. We find there are places we can tune. I think we can achieve the hand-writen assmebly code here. I am testing it.

1 Like

I haven’t compared the schedule HalideIR / assembly code with our hand-writen assmebly code.
If you achieve better results on conv2d, please let me know. I will reevaluate the time cost of our models. Thanks!

Hi, just one quick information.

I have implemented spatial_pack schedule (and some more modification). For the depthwise_conv2d_nchw’, (1, 144, 56, 56, ‘float32’), (144, 1, 3, 3, ‘float32’), (1, 1), (1, 1), ‘float32’), on my single core A9 800MHz cpu, TVM’s data can be from 22ms to 17ms.

For the (‘conv2d’, (1, 64, 56, 56, ‘float32’), (256, 64, 1, 1, ‘float32’), (1, 1), (0, 0), ‘NCHW’, ‘float32’), which is heavy data size like I said before, I have compared with NCHWc schedule v.s, Spatial Pack quickly, NCHWc’s GFLOPS is better than Spatial Pack(Current implementation in TVM). I could share performance data while I finish it.

1 Like