[AutoTVM] Resnet50 and MobileNetv2 after auto-tvm tuning is much slower than the optimized assembly code on ARM Cortex A53

sjtumdlong · October 24, 2018, 9:50am

We haven’t tried NCHWc schedule on our workload before. I thought it was only supported on x86. Could you share how to apply NCHWc schedule on ARM CPU?

FrozenGene · October 24, 2018, 9:59am

Not, it still be useful on ARM CPU too. I thought you don’t need do much work on ARM CPU. You could port it from x86 ARM CPU simply. However, if we want leverage NCHWc best, we should make NCHWc schedule support depthwise convolution, which will erase the data layout transformation(depthwise convolution’s output is NCHW, but conv2d NCHWc require NCHWc, which will make data layout transformation happen). I have done it and writing blog to introduce it, but the blog will not cover the detail and just explain the thought.

merrymercy · October 24, 2018, 7:29pm

Since tvm schedule is hardware independent. You can directly reuse all x86 code for arm cpu by changing only target. You can run this tutorial (https://docs.tvm.ai/tutorials/autotvm/tune_nnvm_x86.html) on arm cpu by changing target and measure_option. This tutorial use the NCHWc template.

I think one goal is to merge the code for x86 and arm cpu, and let users or autotuner to choose the best implementation.

FrozenGene · October 26, 2018, 5:20am

@sjtumdlong
I have implemented related depthwise convolution optimization, on my MTK6763 A53 CPU, it could achieve 2x performance on mobilenet v1 depthwise convolution

sjtumdlong · October 26, 2018, 9:04am

Is the depthwise convolution optimization supported on the master branch of tvm? Could you share how to apply your optimization? Does NCHWc schedule support depthwise convolution now?

FrozenGene · October 26, 2018, 9:27am

I haven’t contribute back to master. I will do it soon, then you can apply it.

I have done the NCHWc schedule of depthwise convolution, however, I haven’t contributed back to master too. This performance is not related with NCHWc schedule.

sjtumdlong · October 27, 2018, 9:27am

Hi, I tried to apply NCHWc schedule by reusing x86 code for arm cpu. I modify target and measure_option in tune_nnvm_x86.py as below.

target = tvm.target.create(‘llvm -device=arm_cpu -target=aarch64-linux-gnu’)
tuning_option = {
‘log_filename’: log_file,
‘tuner’: ‘random’,
‘early_stopping’: None,
'measure_option': autotvm.measure_option(
    builder=autotvm.LocalBuilder(
            build_func='ndk'),
    runner=autotvm.RPCRunner(
            key, host, port,
            number=5,
            timeout=1e9,),
),
}

But when I ran auto-tune with NCHWc schedule, it failed to execute on my device. I can run tune_nnvm_arm.py successfully. Error message:

DEBUG:autotvm:No: 1 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(’’,), error_no=7, all_cost=1000000000.0, timestamp=1540631710.684298) [(‘tile_ic’, [1, 3]), (‘tile_oc’, [2, 32]), (‘tile_ow’, [4, 28]), (‘unroll_kw’, True)],None,109

DEBUG:autotvm:No: 2 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(’’,), error_no=7, all_cost=1000000000.0, timestamp=1540631710.759403) [(‘tile_ic’, [1, 3]), (‘tile_oc’, [2, 32]), (‘tile_ow’, [56, 2]), (‘unroll_kw’, False)],None,151

DEBUG:autotvm:No: 3 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(’’,), error_no=7, all_cost=1000000000.0, timestamp=1540631710.759512) [(‘tile_ic’, [1, 3]), (‘tile_oc’, [64, 1]), (‘tile_ow’, [4, 28]), (‘unroll_kw’, False)],None,225

DEBUG:autotvm:No: 4 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(’’,), error_no=7, all_cost=1000000000.0, timestamp=1540631710.759587) [(‘tile_ic’, [1, 3]), (‘tile_oc’, [64, 1]), (‘tile_ow’, [2, 56]), (‘unroll_kw’, False)],None,239

DEBUG:autotvm:No: 5 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(’’,), error_no=7, all_cost=1000000000.0, timestamp=1540631710.759663) [(‘tile_ic’, [3, 1]), (‘tile_oc’, [16, 4]), (‘tile_ow’, [7, 16]), (‘unroll_kw’, True)],None,88

merrymercy · October 27, 2018, 9:31am

delete -device=arm_cpu in the target string?

FrozenGene · October 29, 2018, 12:35pm

@sjtumdlong You can use this PR to improve depthwise convolution performance: https://github.com/dmlc/tvm/pull/2028

You should notice that you MUST make the XGBTunner constructor’s feature type argument be feature_type= ‘knob’. i.e. XGBTuner(tsk, loss_type=‘rank’, feature_type=‘knob’) in your auto-tvm tunning script.

sjtumdlong · October 30, 2018, 11:29am

OK, I will try it later. Thanks!

dollphintear · May 8, 2019, 2:46am

when doing the auto-tuning, what did you set the argument of ops in the tasks = autotvm.task.extract_from_program(net, target=target,
params=params, ops=?)

DEBUG:autotvm:No: 1	GFLOPS: 0.00/0.00	result: MeasureResult(costs=(’’,), error_no=7, all_cost=1000000000.0, timestamp=1540631710.684298)	[(‘tile_ic’, [1, 3]), (‘tile_oc’, [2, 32]), (‘tile_ow’, [4, 28]), (‘unroll_kw’, True)],None,109
DEBUG:autotvm:No: 2	GFLOPS: 0.00/0.00	result: MeasureResult(costs=(’’,), error_no=7, all_cost=1000000000.0, timestamp=1540631710.759403)	[(‘tile_ic’, [1, 3]), (‘tile_oc’, [2, 32]), (‘tile_ow’, [56, 2]), (‘unroll_kw’, False)],None,151
DEBUG:autotvm:No: 3	GFLOPS: 0.00/0.00	result: MeasureResult(costs=(’’,), error_no=7, all_cost=1000000000.0, timestamp=1540631710.759512)	[(‘tile_ic’, [1, 3]), (‘tile_oc’, [64, 1]), (‘tile_ow’, [4, 28]), (‘unroll_kw’, False)],None,225
DEBUG:autotvm:No: 4	GFLOPS: 0.00/0.00	result: MeasureResult(costs=(’’,), error_no=7, all_cost=1000000000.0, timestamp=1540631710.759587)	[(‘tile_ic’, [1, 3]), (‘tile_oc’, [64, 1]), (‘tile_ow’, [2, 56]), (‘unroll_kw’, False)],None,239
DEBUG:autotvm:No: 5	GFLOPS: 0.00/0.00	result: MeasureResult(costs=(’’,), error_no=7, all_cost=1000000000.0, timestamp=1540631710.759663)	[(‘tile_ic’, [3, 1]), (‘tile_oc’, [16, 4]), (‘tile_ow’, [7, 16]), (‘unroll_kw’, True)],None,88