[NNVM] FoldScaleAxis for Winograd and NCHWc convolution

Hi, I noticed that the FoldScaleAxis pass is not enabled for CUDA Winograd with weight transform precomputed and x86 NCHWc convolution. So the “broadcast_mul” op of batch norm doesn’t go away after compiling. Is it intended? @merrymercy @yzhliu

To enable this, we need to add registrations for

_contrib_conv2d_NCHWc
_contrib_conv2d_winograd_without_weight_transform

following these lines here

Also, I think AlterLayout pass needs to happen after FoldScaleAxis pass.

I’m aware of this issue, we should add the support.

@yzhliu ok, is there any reason AlterLayout pass should happen before everything else, here?

It needs to happen before simplify_inference, which is able to handle modified shape. otherwise once batch_norm is dissolved, it is not easy to target batch norm anymore, which will introduce layout transformer for these dissolved broadcast_add and broadcast_mul, etc. It can still work, but performance is not as good.

I think AlterOpLayout can happen before FoldScaleAxis, but does FoldScaleAxis depend on SimplifyInference ?

yes, FoldScaleAxis looks for broadcast_mul op, which is unpacked from batch norm. So FoldScaleAxis should happen after SimplifyInference.

This issue seems more complicated than I thought … And I don’t know how much enabling FoldScaleAxis would help improve performance. Probably the difference would be minimal.

According to my test on arm cpu, the difference is very small.