[Quantization][AutoTVM] Performance degradation due to transpose ops when NCHW layout is used

Hi,

I would like to deploy a tensorflow model in x86 and also use quantization. First, after doing some preliminary benchmarking I noticed that the NHWC provides better performance than NCHW. However, for quantization NCHW is required, but this results in a performance degradation. Moreover, I noticed that due to the use of the NCHW layout, lots of transpose ops are added and this might be the reason of the performance degradation.

When I tune with AutoTVM, the performance of the model gets better close but still worse than the FP32 performance.

I would appreciate if someone could clarify me this situation.

Thanks

@vinx13 could you maybe comment on this? How to avoid the transpose operators? The input image to the network has the data layout NWHC.

There has been discussion on frontend layout transformation (https://github.com/dmlc/tvm/issues/2519)
Transpose is inserted because of the difference between the original layout and the target layout.
If you want to use NCHW layout, you can try converting your model to NCHW and then pass to the frontend. In this way we can avoid transpose between conv layers.

Hi @vinx13, I have some further questions:

Is there any tool that can do the change in the data layout of the model or I have do to the change in the Tensorflow code of the model manually?

Why is only the NCWH layout supported by the TVM quantization?

Is the NCWH layout better for x86?

Thanks

Is there any tool that can do the change in the data layout of the model or I have do to the change in the Tensorflow code of the model manually?

It depends on the framework you are using.

Why is only the NCWH layout supported by the TVM quantization?

It would be good to support both layouts. Contributions are welcomed.

Is there any tool that can do the change in the data layout of the model or I have do to the change in the Tensorflow code of the model manually?

It depends on the framework you are using.

Tensorflow is the one I am using. Are you aware of any framework for the layout transformation?

Also, regarding my other question on x86. Is one layout more convenient for GPUs vs CPUs?

Thanks

Yes layout can affect efficiency of memory access.
In GPU it is NCHW or NCHW4c. In CPU it is NCHW[c].
You need to benchmark to decide which layout is best.

Hi @vinx13

Could you confirm please if only the NCHW layout can be quantized and not NHWC?

Thanks

Yes only NCHW for now

Ok, it would then nice to put an assertion in the quantization pass to make this point clear as it happens in AutoTVM which shows an assertion saying that NHWC is not supported.

@vinx13 I was wondering if you could tell me what is the best approach to select the global scale and if is better to use the global scale or local scale?. Also could you tell me how to set local scales, I have not been able to find how to do that?

For global scale, as we prefer power-of-2 scale, there are only a few candidates (4, 8, 16, …), you can try it on the dataset.
For local scales, you can take a look of this script https://gist.github.com/vinx13/6f1eb1f9e2c0a8786149ee881bfcd6aa , which use KL-divergence to compute the scale. (this script is out-dated, I will update it soon)

Hi @vinx13

It would be great to have this script updated, thanks for that!. Is there any documentation on local scales on TVM? Also, I was wondering if you could clarify me a bit about the differences and pros and cons of local vs global scale?

Thanks!