[Relay][AutoQ] Bad accuracy for Automatic Quantization

I am trying to test the robustness of Automatic quantization across frameworks - TF, Pytorch and MXNet. I tried resnet50 and validated on Imagenet validation dataset with 1000 images. Currently, the results are kind of all over the place.

The data has been preprocessed for each framework separately, which is also reflected in the original accuracy. The results show the accuracy numbers for 3 modes

  • Original FP32 model
  • No data - just set global scale, but no calibration
  • Data - Calibration with kl_divergence

MXNet

mxnet_resnet50_v1_fp32 (74.3, 91.4)
mxnet_resnet50_v1_no_data (74.4, 91.5)
mxnet_resnet50_v1_data (6.8, 17.2)

Tensorflow

tf_resnet_50_fp32 (71.0, 92.0)
tf_resnet_50_no_data (27.0, 56.0)
tf_resnet_50_data (60.0, 83.0)

Pytorch

pytorch_resnet50_fp32 (76.6, 91.4)
pytorch_resnet50_no_data (73.6, 90.4)
pytorch_resnet50_data (29.0, 47.9)

Are there any known issues which we can work on to improve the accuracy? It might also be good idea to work together on improving the robustness of quantization and extend it different types of models like object detection and BERT.

If anyone is interested, please let me know. I can provide my scripts to reproduce these results.

@vinx13 @masahi @eqy @ziheng @tico