[Quantization][AutoTVM] Performance degradation of Quantization + AutoTVM vs FP on x86

Hi,

I am trying to quantize and tune some TF models on x86. However, the performance results are extremely poor compare with the non-quantize version. The numbers are as follows:

  • First model
    TVM FP32: 35.05ms
    TVM int8 quantization: 80.ms
    TVM int8 quantization + AutoTVM: 46.87ms

  • Second model
    TVM FP32: 72.85ms
    TVM int8 quantization: 159.33ms
    TVM int8 quantization + AutoTVM: 112.39ms

What is the reason for such a bad performance? What can be done to try to improve performance?

@vinx13 Any ideas?

I would suggest comparing performance of conv2d layer by layer to see if we can improve current int8 conv2d implementation. We can also check if the fusion result (after FuseOps pass) is optimal.

Thanks for the suggestions. I will compare every conv2d performance with the TVM profiler. Regarding the fusion result, how can we verify if it is optimal?

you can check if there are anything fusible that are not fused

@vinx13

After checking the TVM profiler and compare with and without quantization it is clear that in the quantized version the fused convolutions are slower. Actually, are twice as slow as in FP.

Also I see that because of the data layout used, the added transposed operators are not quantized, which means that before and after every convolution there is the translation from INT fo FP. This of course adds a lot of overhead.

Do you have any suggestions or thoughts about this?

Thanks

If you set all scales to be power-of-2, cast from int to fp can be avoided.

Re the performance degradation, @kevinthesun @janimesh might have idea?

You mean setting this parameter like this: global_scale=8.0,? Or what do you mean by scales?

global_scale=8.0 or your custom scales in calibration

Ok and how go you set the custom scales during calibration?

calibrate accepts scales argument, you can call calibrate to set the scales