[TOPI] Using x86 schedules for ARM conv2d

anijain2305 · April 17, 2020, 1:36am

Currently, Intel and ARM have different conv2d FP32 schedules. I tried Intel x86 schedules on ARM Rasp4 device. My hypothesis was that, if we are not using tensorize, the schedules should be reusable and the one that does better data reuse utilization and prefetcher-friendly accesses should perform better on both Intel and x86 devices (given LLVM does the right thing for us).

Intel schedules also perform data layout conversion from NCHW to NCHWc at Relay level, and reuse the data layout across many conv2d, amortizing the cost. ARM NCHW conv2d spatial pack schedule, on the other hand, converts data layout inside the conv2d schedule (to NHWChw and not NCHWc) for each conv2d.

Here, I present the performance comparison between different options. My goal is to bring clarity as to what is the best thing amongst so many options.

Setup

Device - ARM Raspberry Pi4 - 1.5 GHz

Code Changes - https://github.com/apache/incubator-tvm/pull/5334

I used op strategy to enable both Intel depthwise conv and Intel NCWHc conv2d on ARM device. ARM winograd schedule is very powerful, and I let op strategy and AutoTVM choose it whenever it is faster. The networks have been tuned using AutoTVM.

Different Options - I compare the performance between following three options

ARM sch = ARM Conv2d spatial + ARM winograd + x86 DWC

x86 sch = x86 conv2d + ARM winograd + x86 DWC

Both sch = Choose the best kernel from the full tuning log

Best = Choose the best latency amongst above 3 options

Evaluation

Network	ARM sch (ms)	x86 sch (ms)	Both sch (ms)	Best sch
mobilenet-v1	90.72	72.46	84.79	x86 sch
mobilenet-v2	70.56	57.39	62.1	x86 sch
inception-v3	697.03	587.59	639.4	x86 sch
inception-v4	1825.56	1271.27	1548.91	x86 sch
inception-resnet-v2	2163.43	1158.03	1866.85	x86 sch
squeezenet	103.84	92.24	94.44795	x86 sch

I also compared against TFLite. TFlite is built from source and I use 4 threads for measuring performance.

Network	TVM Best (ms)	TFLite (ms)	Speedup over TFLite
mobilenet-v1	72.46	157	2.16671
mobilenet-v2	57.39	128	2.23035
inception-v3	587.59	1030	1.75292
inception-v4	1271.27	2055	1.61649
inception-resnet-v2	1158.03	2030	1.75298
squeezenet	92.24	200	2.16826

As @FrozenGene suggested, I am also adding comparison against single thread

Network	TVM Best (ms)	TFlite (ms)	Speedup over TFLite
mobilenet-v1	232.55	279	1.19974
mobilenet-v2	151.49	179	1.1816
inception-v3	2020.66	2673	1.32284
inception-v4	4530.12	5552	1.22557
inception-resnet-v2	4075.64	5041	1.23686
squeezenet	301.6	439	1.45557

Observations

Intel x86 schedules perform best amongst all the options.
Contrary to expectations, “Both sch” that chooses the best kernel performs worse than x86 sch. I think the main reason is that mixing Intel x86 and ARM conv2d pack schedule lead to a large number of layout transforms.
ARM winograd performance is awesome. Maybe we should try it on Intel x86 in a separate PR. However, it apparently has high memory footprint as it fails runtime for resnet due to Out of Memory error. Removing winograd works.

Discussion points/Next Steps

Does it make sense to enable x86 schedules for ARM? And disable conv2d NCHW spatial pack schedule? If we just add more schedule options w/o disabling anything, AutoTVM tuning time goes up significantly.
TFLite graphs initially have NHWC data layout. I call ConvertLayout to first convert it to NCHW. Then, AlterOpLayout internally converts it to NCHWc. There was an effort to directly improve the performance of NHWC schedule for ARM some time back, but it seems it has been put on hold. Till we don’t have a performant NHWC schedule, does it make sense to change the AutoTVM TFLite tutorial to use ConvertLayout.

@tqchen @jackwish @FrozenGene @merrymercy @thierry @masahi @yzhliu

tqchen · April 14, 2020, 6:27pm

In this particular case, I think we want to keep both strategies (the one for NCHWc and the original one for NCHW) because the NCHWc brings certain restrictions that may not be generally applicable.

Even better, we should make the NCHW schedule global across ARM and intel when possible

This is made easy by the strategy design from @haichen

In terms of the tuning cost, while thay could be a concern, I think it is better to control the set of schedules in the autotvm configuration via a customized search strategy, rather than disabling them in the codepath. Manually disable certain codepath in the codebase couples this two perspectives together and may cause surprises in the future.

Rationale: We always want to de-couple capabilities(e.g. we can perform the spatial pack schedule) from the optimization strategy(which subset of strategies are more promising). Because the collection of strategies represent space that provide potential gains in different places, we could implement different optimization strategies to pick the best one, some of the strategies can be more exhaustive and some others can be more nible and directly avoid the ones that we know are not promising.

anijain2305 · April 14, 2020, 6:27pm

In terms of the tuning cost, while thay could be a concern, I think it is better to control the set of schedules in the configuration, rather than disabling them in the codepath

I agree. Actually, one can easily clean up the tasks in AutoTVM script by looking for task name. There might be a better way to control the set of configuration option from outside to get better TVM user experience. But, in general, I agree that we can just add more options, and not disable anything.

kevinthesun · April 14, 2020, 8:13pm

Thank you for bringing this up. I agree we can add x86 conv2d schedule to arm cpu strategy and also keep the current spatial pack schedule.

masahi · April 14, 2020, 9:05pm

Please do consider this. Also see What happened to x86 winograd?

FrozenGene · April 15, 2020, 12:25am

The most important advantage of NCHWc is we split C into the inner and could vectorize it. I think if we do NHWC schedule we could achieve the same effect, which could make us reduce the convert layout and layout transform. This is why TFLite / TF choost NHWC layout The convert layout pass is not always good if we have shape transformation operator in models, like reshape / concat / squeeze and so on, this is very normal in object detection model.

For performance comparasion, please also consider the single thread. The TFLite’s thread pool is very very bad, the performance of single thread could reflect its power more.

For arm winograd, yes, we should port it into x86. Considered we have x86 NCHWc, we should do NCHWc winograd.

anijain2305 · April 15, 2020, 12:43am

For performance comparasion, please also consider the single thread. The TFLite’s thread pool is very very bad, the performance of single thread could reflect its power more.

Makes sense. I will add tomorrow. Have 2 devices at disposal, and they are doing tuning for now

The convert layout pass is not always good if we have shape transformation operator in models, like reshape / concat / squeeze and so on, this is very normal in object detection model.

We can work on improving ConvertLayout. I think it handles concat already. Reshape and Squeeze are not supported AFAIK. This is a separate topic, but in general over time, ConvertLayout should support a large number of operators. It is beneficial in general, not just for this discuss post.

For NHWC schedule and Winograd NCHWc efforts, we can handle them in separate PRs. Those require considerable efforts.

FrozenGene · April 15, 2020, 5:33am

Nice!

This will bring extra layout transform ops and overhead cost, so I bring this into this post. I suspect NHWC schedule could achieve the same performance of NCHWc. If so, we don’t need do extra layout transform.

NHWC schedule, @jackwish has committed convolution (lack of depthwise convolution I remember). So we should bring this easily. NCHWc winograd has been done by @ajtulloch before (Improved Direct + Winograd NCHWc CPU implementation, with ResNet-50 results) But maybe need some effort to enable it. We could do it in another prs like you said.

anijain2305 · April 15, 2020, 8:36am

I created an Issue here to track all the ideas - https://github.com/apache/incubator-tvm/issues/5340

tqchen · April 15, 2020, 3:29pm

I like the overall directions we are going toward. i.e. make most of the schedule template as generic as possible and allow easy selection of combinations during strategy declaration. This would come very handy when adding support for other CPU types

anijain2305 · April 15, 2020, 5:40pm

Thanks all. The PR is merged.

For this discuss post, there is one item remaining. Should we add a TFLite AutoTVM tutorial (or append the existing TFLite compilation tutorial - https://docs.tvm.ai/tutorials/frontend/from_tflite.html#sphx-glr-tutorials-frontend-from-tflite-py to make it use the changes made in above PR?

There are 2 action items

Use ConvertLayout to go to NCHW, so that AlterOpLayout can convert to NCHWc.
Add an autotvm util function on the lines of autotvm.remove_template(tasks, template_name) that gives TVM user some control of config options.

@FrozenGene let us know what you think about this.

kevinthesun · April 15, 2020, 10:01pm

@anijain2305 @FrozenGene Do we have any comparison between conv2d_spatial_pack_nhwc VS conv2d_spatial_pack_nchw for arm? One potential issue for current conv2d_spatial_pack_nhwc and conv2d_spatial_pack_nchw schedule is that, to achieve prefetch and register tiling friendly, we need to pack/unpack data layout internally in topi compute, which causes quite a lot overheads. This means if we just keep NCHW or NHWC layout in relay level, we can’t avoid these overheads. That is why we proposed NCHWc layout for x86 and use graph tuner to reduce layout transformation.

I agree we need to further improve CoverLayout pass to support more operators. However, if we can already get performance improve using ConverLayout + NCHWc schedule, it’s still worthy to recommend, or at least mention this in the tutorial.

kindlehe · April 16, 2020, 3:03am

@anijain2305

The schedules comparison looks great.

Could we get the tutorials or scripts to reproduce the results reported here?

I see the commited code, but its usage is a little difficult for users who wants to apply tvm’s newest features.

FrozenGene · April 16, 2020, 3:59am

I prefer creating another one tutorial. from_tflite should be like other frontend tutorial, which is just to show how to import model into tvm and run.

I prefer adding one util function so that users could choose it by themselves unless we find this should be the base and won’t consider other options.

FrozenGene · April 16, 2020, 4:11am

I have searched the data from my computer everywhere. However I only find one quantized data performance I could publish.

The spatial pack is NCHW layout (lack of NHWC). TVM tensorize is NHWC layout. I am sorry I lost the data of spatial pack of NHWC. Spatial pack of FP32 (NHWC) I remember it perform better than NCHW. NCHWc should not be collected.

anijain2305 · April 16, 2020, 5:32am

I understand. Will try to get this in some kind of tutorial if possible. If that takes too much time, I will share a script to get you started.

anijain2305 · April 16, 2020, 5:34am

Sorry, I did not understand this. Current tutorial - https://docs.tvm.ai/tutorials/frontend/from_tflite.html#sphx-glr-tutorials-frontend-from-tflite-py - shows how to import a TFLite model and compile and execute.

If I understand correctly, you are suggesting to write a new tutorial from scratch that shows how to run AutoTVM on TFlite models, right?

FrozenGene · April 16, 2020, 5:54am

Yes. Your understanding is correctly. Current TFLite tutorial should just show how to import/compiler/execute like other frontend tutorial.

kevinthesun · April 16, 2020, 6:08am

One possible way to organize the tutorial is that we add notes into “Compile tflite model” to guide user to read corresponding autotvm tutorial for optimization. One example is deploy ssd models tutorial. It contains a note section indicating where to go for optimization. For optimization, we can add a tflite model into “Auto-tuning a convolutional network for ARM CPU” tutorial, and show how to use ConvertLayout + NCHWc to get more performance. Does this sound reasonable?

anijain2305 · April 16, 2020, 6:25am

Ok, I will spend some time writing a new tutorial - Autotuning a TFLite model for ARM.

I will run Auto-tvm with FP32 NHWC schedule. We can then compare ARM NCHW, ARM NHWC and Intel NCHWc, and decide if we want ConvertLayout in the tutorial.

In either case, in the tutorial we can discuss about all the possible options for the data layouts. And define what we can expect. I will try to keep all the options open - ARM NHWC, ARM NCHW and Intel NCHWc in the tutorial.