Qusetions about the conv2d schedule template of arm_cpu

XinchengHan · November 28, 2018, 1:29am

Feeling confusion about some design decision on topi/arm_cpu/conv2d.py :

‘Direct’ and ‘Winograd’ methods are provided for conv without the popular ‘im2col’. Why?
Is there any theoretical or experimental conclusion on which is better in terms of efficiency?
If I want to try a new method for conv(e.g. im2col) and let auto-tvm do the best choice, what should I do except providing schedule template for the method?
I am not good at python , and the register mechanism in topi/autotvm seems hard for me. It will be nice if someone could make the process clear or provide some samples on it.
In ‘winograd’ method, why do we choose tile_size=4 instead of making it tunable? It seems that some other frameworks choose tile_size=6/2 for different shapes.

eqy · November 28, 2018, 1:39am

im2col can have some benefits for certain layouts. We will welcome a PR that adds an im2col template to autotvm.
To support another algorithm strategy, such as im2col, a few steps are needed in addition to providing the schedule template.
First, you must register the compute declaration (you can borrow this from old im2col code) that describes the computation in addition to the data layout transformations. The example for the direct case is here (this does not have a data layout transformation step).
Then, hook in your schedule template function here.
Basically the steps are to add im2col so that the correct compute declarations/schedule functions fire when that strategy is chosen.
@merrymercy Can provide clarification here

XinchengHan · November 28, 2018, 3:41am

@eqy Thanks for your reply. It helps a lot!

So you mean what I should do is just complete the compute-schedule process for the new strategy using @autotvm.register_topi_xxx. Then the autotvm will list it as a candidate while tuning. Is that correct?

eqy · November 28, 2018, 7:36pm

Yes, you can specify a specific template you want to tune like the tutorial does. I think the default behavior if you want to tune from a pre-specified graph (e.g., model defined in NNVM) is to only use the direct template as that is what task extraction produces.

For now, if you just want to try some experiments with templates, you can look at a standalone example if you want to avoid the declaration and schedule boilerplate.

XinchengHan · November 30, 2018, 7:10am

@eqy Thanks a lot!

@merrymercy Can you help me with the tile_size here?

I tried to set tile_size=6 for better performance. However, the time consumed raise about 30%.

And for some input shape, I got the following warning:
.../vectorize_loop.cc:303: Detect vector candition in Vectorized Loop, scalarizing...

Why does the vectorized-loop process be affected? In my understanding, the vectorization happends along the P = N * nW * nH axis, doesn’t it?

merrymercy · November 30, 2018, 6:35pm

Direct with tuning is better than im2col in most cases. The benefit of im2col is easy utilization of BLAS, however, we don’t use these libraries in tvm.
The tile_size is chosen based on the benchmark on common networks.
We don’t make it tunable because the in current implementation tile size will affect the space size. i.e., tile size 2 and tile size 4 have different tuning spaces. But you can try to eliminate this effect and make it tunable.
However, we also observed some problems related to measurement
Improved Direct + Winograd NCHWc CPU implementation, with ResNet-50 results
I think some fixes are required.

I think your vectorization problem is due to this line (https://github.com/dmlc/tvm/blob/94acff30e82f9352e1652bd81be260b672419aea/topi/python/topi/arm_cpu/conv2d.py#L370). As we will do vectorization on the last dimension, maybe nW is too small and bb is larger than nW so we cannot eliminate that mod operation.

FrozenGene · December 3, 2018, 2:26am

I have implemented im2col auto-tvm version on ARM CPU. But I don’t observer performance better than SpatialPack on Mobilenet. So I think we don’t have need to add. But you can try it.

XinchengHan · December 3, 2018, 3:37am

Thanks for such clear explanation. I will check the details you mentioned.

The reason I reach the problem is that I think for winograd F(6x6,3,3) will perform better than F(4x4, 3x3) theoretically since the former reduce more calculations. But the test data told me a different story.

Is there any more factor I should take into account for analysing that?

XinchengHan · December 3, 2018, 4:04am

Thanks for the information!

I’ve noticed that you have done a lot of work about optimization on arm_cpu and now I am on the same road.
I would appreciate that if you could provide some guidance on how to optimize the conv.

I followed your suggestion here (by the way it helps a lot, thx~). Now I could roughly understand how the compute/schedule process works for per operator.
But it’s hard to go deeper (e.g. the lower process) for me since lack of knowledge on compiler, which makes the following work not easy.

FrozenGene · December 3, 2018, 6:26am

I think it is a good start: https://docs.tvm.ai/tutorials/optimize/opt_gemm.html#sphx-glr-tutorials-optimize-opt-gemm-py

Try to understand the concept on previous link, which is the fundamental of convolution optimization. The link also contains lower ir you want to know.

After this, try to understand the existing schedule: spatial pack. Then try to modify and implement your own schedule.