[AutoTVM] Resnet50 and MobileNetv2 after auto-tvm tuning is much slower than the optimized assembly code on ARM Cortex A53

We haven’t tried NCHWc schedule on our workload before. I thought it was only supported on x86. Could you share how to apply NCHWc schedule on ARM CPU?

Not, it still be useful on ARM CPU too. I thought you don’t need do much work on ARM CPU. You could port it from x86 ARM CPU simply. However, if we want leverage NCHWc best, we should make NCHWc schedule support depthwise convolution, which will erase the data layout transformation(depthwise convolution’s output is NCHW, but conv2d NCHWc require NCHWc, which will make data layout transformation happen). I have done it and writing blog to introduce it, but the blog will not cover the detail and just explain the thought.

Since tvm schedule is hardware independent. You can directly reuse all x86 code for arm cpu by changing only target. You can run this tutorial (https://docs.tvm.ai/tutorials/autotvm/tune_nnvm_x86.html) on arm cpu by changing target and measure_option. This tutorial use the NCHWc template.

I think one goal is to merge the code for x86 and arm cpu, and let users or autotuner to choose the best implementation.

@sjtumdlong
I have implemented related depthwise convolution optimization, on my MTK6763 A53 CPU, it could achieve 2x performance on mobilenet v1 depthwise convolution

Currently:
[Task 2/20] Current/Best: 0.98/ 2.32 GFLOPS | Progress: (1427/2000) | 2679.82 s Done.
[Task 4/20] Current/Best: 0.56/ 1.15 GFLOPS | Progress: (1072/2000) | 2461.27 s Done.
[Task 6/20] Current/Best: 1.08/ 2.78 GFLOPS | Progress: (1084/2000) | 1987.91 s Done.
[Task 8/20] Current/Best: 0.39/ 1.19 GFLOPS | Progress: (1815/2000) | 2744.70 s Done.
[Task 10/20] Current/Best: 1.09/ 2.33 GFLOPS | Progress: (1222/2000) | 1866.02 s Done.
[Task 12/20] Current/Best: 0.42/ 0.90 GFLOPS | Progress: (1716/2000) | 2528.94 s Done.
[Task 14/20] Current/Best: 1.89/ 2.63 GFLOPS | Progress: (1284/2000) | 2288.55 s Done.
[Task 16/20] Current/Best: 0.47/ 0.96 GFLOPS | Progress: (1467/2000) | 2282.65 s Done.
[Task 18/20] Current/Best: 1.43/ 2.61 GFLOPS | Progress: (1007/2000) | 1525.76 s Done.

After my optimization:
[Task 2/20] Current/Best: 0.00/ 4.83 GFLOPS | Progress: (1682/2000) | 1470.40 s Done.
[Task 4/20] Current/Best: 1.35/ 3.17 GFLOPS | Progress: (1257/2000) | 1032.80 s Done.
[Task 6/20] Current/Best: 2.04/ 5.49 GFLOPS | Progress: (1904/2000) | 1623.10 s Done.
[Task 8/20] Current/Best: 0.75/ 3.15 GFLOPS | Progress: (1885/2000) | 1546.22 s Done.
[Task 10/20] Current/Best: 2.09/ 6.07 GFLOPS | Progress: (2000/2000) | 1640.41 s Done.
[Task 12/20] Current/Best: 2.99/ 3.80 GFLOPS | Progress: (1853/2000) | 1547.13 s Done.
[Task 14/20] Current/Best: 4.59/ 6.06 GFLOPS | Progress: (1355/2000) | 1091.93 s Done.
[Task 16/20] Current/Best: 1.96/ 4.01 GFLOPS | Progress: (2000/2000) | 1586.18 s Done.
[Task 18/20] Current/Best: 2.33/ 4.63 GFLOPS | Progress: (2000/2000) | 1599.89 s Done.

1 Like

Is the depthwise convolution optimization supported on the master branch of tvm? Could you share how to apply your optimization? Does NCHWc schedule support depthwise convolution now?

I haven’t contribute back to master. I will do it soon, then you can apply it.

I have done the NCHWc schedule of depthwise convolution, however, I haven’t contributed back to master too. This performance is not related with NCHWc schedule.

Hi, I tried to apply NCHWc schedule by reusing x86 code for arm cpu. I modify target and measure_option in tune_nnvm_x86.py as below.

target = tvm.target.create(‘llvm -device=arm_cpu -target=aarch64-linux-gnu’)
tuning_option = {
‘log_filename’: log_file,
‘tuner’: ‘random’,
‘early_stopping’: None,

'measure_option': autotvm.measure_option(
    builder=autotvm.LocalBuilder(
            build_func='ndk'),
    runner=autotvm.RPCRunner(
            key, host, port,
            number=5,
            timeout=1e9,),
),

}

But when I ran auto-tune with NCHWc schedule, it failed to execute on my device. I can run tune_nnvm_arm.py successfully. Error message:

DEBUG:autotvm:No: 1 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(’’,), error_no=7, all_cost=1000000000.0, timestamp=1540631710.684298) [(‘tile_ic’, [1, 3]), (‘tile_oc’, [2, 32]), (‘tile_ow’, [4, 28]), (‘unroll_kw’, True)],None,109
DEBUG:autotvm:No: 2 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(’’,), error_no=7, all_cost=1000000000.0, timestamp=1540631710.759403) [(‘tile_ic’, [1, 3]), (‘tile_oc’, [2, 32]), (‘tile_ow’, [56, 2]), (‘unroll_kw’, False)],None,151
DEBUG:autotvm:No: 3 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(’’,), error_no=7, all_cost=1000000000.0, timestamp=1540631710.759512) [(‘tile_ic’, [1, 3]), (‘tile_oc’, [64, 1]), (‘tile_ow’, [4, 28]), (‘unroll_kw’, False)],None,225
DEBUG:autotvm:No: 4 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(’’,), error_no=7, all_cost=1000000000.0, timestamp=1540631710.759587) [(‘tile_ic’, [1, 3]), (‘tile_oc’, [64, 1]), (‘tile_ow’, [2, 56]), (‘unroll_kw’, False)],None,239
DEBUG:autotvm:No: 5 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(’’,), error_no=7, all_cost=1000000000.0, timestamp=1540631710.759663) [(‘tile_ic’, [3, 1]), (‘tile_oc’, [16, 4]), (‘tile_ow’, [7, 16]), (‘unroll_kw’, True)],None,88

delete -device=arm_cpu in the target string?

@sjtumdlong You can use this PR to improve depthwise convolution performance: https://github.com/dmlc/tvm/pull/2028

You should notice that you MUST make the XGBTunner constructor’s feature type argument be feature_type= ‘knob’. i.e. XGBTuner(tsk, loss_type=‘rank’, feature_type=‘knob’) in your auto-tvm tunning script.

OK, I will try it later. Thanks!

when doing the auto-tuning, what did you set the argument of ops in the tasks = autotvm.task.extract_from_program(net, target=target,
params=params, ops=?)