Benchmarking Quantization on Intel CPU


#1

Background: There has been many papers in academic literature on quantizing weight tensors in deep learning models to reduce inference latency and memory footprint. Also, TVM recently gained the ability to quantize weights (https://github.com/dmlc/tvm/pull/2116)

I am currently working on a systematic benchmark for existing frameworks for (post-training) quantization. A rigorous benchmark will help machine learning practitioners make informed decisions. Any suggestions are welcome.

Frameworks:

Models: for now, only Image Classification.

  • Inception V3
  • ResNet-50 V1
  • ResNet-152 V1
  • DenseNet201
  • MobileNet V2_1.0

Metrics to measure:

  • Top-1 / Top-5 Accuracy on ImageNet Validation Set
  • Inference time per image
  • Model size in memory (MB)
  • Model size as a serialized file (MB)

Benchmark environment: AWS EC2 C5d.18xlarge (72 vCPUs)

  • To account for variance due to virtualization and shared hardware, I will perform multiple trials by launching new instance(s). I will perform statistical tests to verify if an apparent difference in a metric is significant or due to chance.

To ensure that we are comparing apples to apples, benchmark code (with links to models used) will be published as a GitHub repository.


#2

For consistency, we will use the simple method for quantization: Simply take min and max values of layer outputs as thresholds for quantization.


#3

@hcho3.
Hi
Facing issues at quantized tensorflow model.

tvm.error.OpNotImplemented: The following operators are not supported in frontend TensorFlow: ‘Dequantize’

While running tensorflow model from nnvm frontend.
is there support for post quantization tensorflow models at NNVM?


#4

@Vinayak618 No, currently TVM does not support reading from quantized TensorFlow models. Here is a proposal for supporting quantized TF-Lite models: https://github.com/dmlc/tvm/issues/2351. For now, you’ll have to load the original (non-quantized) model into TVM and use TVM’s quantization tools to perform quantization.


#5

@hcho3.
Thank you for the response.
Can you guide me from where can i find TVM’s quantization tools to apply it on tensorflow model?
I dint find that in the link above.

Also one query not related to the above issue.
Does opt_level in tensorflow NNVM frontend have any significance after opt_level 3.
I’m getting the results even at opt_level 10 so.

Thank you


#6

@Vinayak618 I’m trying to figure out TVM’s quantization pass myself, so I won’t be able to guide you right now. I will put up the benchmark code when it’s done, and you can look at it then. For now, you should look at the pull request https://github.com/dmlc/tvm/pull/2116.

And I think opt_level goes only up to 3.

Ps. If you have other questions, please open a new thread. Let’s keep this thread for discussing the benchmark proposal.


#7

Thank you @hcho3.
Yeah sure will keep it for discussing benchmark proposal.
Once you are done with benchmark code. Please put up the same in this thread. In the mean time ill go through the pull request.

Thank you


#8

@Vinayak618 I just found this script: https://gist.github.com/ZihengJiang/bcabe46a712a417a01a6967d4430b6b5, an example that feeds MXNet model into TVM and run quantization.


#9

Thank you so much @hcho3.
I’ll try it and share the observations.


#10

Hi. @hcho3.
Thank you for referencing the code.
It works great,. I’m able to quantize the model and get the results.

Thank you once again.


#11

@hcho3 thanks for pushing on this

We are currently working on some enhancements to quantization on the TVM side, as some models (DenseNet, MobileNet) need per-channel quantization scale adjustment to avoid catastrophic accuracy loss.

Another issue is that models that use depthwise convolution such as mobilenet will currently see limited speedup vs. floating-point versions because TVM currently lacks schedules for depthwise convolution with NCHWc or NHWC data layouts (preventing vectorization).

Currently the most interesting results will be with Inception and ResNet.


#12

https://github.com/intel/optimized-models/tree/v1.0.6/mxnet/blog/medium_vnni

For MXNet on C5.18xlarge, you can try to use our script to reproduce the result.

For mobilenet v2, some parts of optimization is WIP to upstreaming, will let you know when it’s done.

The published blog as below and you can cite when using our data or script.