Benchmarking Quantization on Intel CPU


#1

Background: There has been many papers in academic literature on quantizing weight tensors in deep learning models to reduce inference latency and memory footprint. Also, TVM recently gained the ability to quantize weights (https://github.com/dmlc/tvm/pull/2116)

I am currently working on a systematic benchmark for existing frameworks for (post-training) quantization. A rigorous benchmark will help machine learning practitioners make informed decisions. Any suggestions are welcome.

Frameworks:

Models: for now, only Image Classification.

  • Inception V3
  • ResNet-50 V1
  • ResNet-152 V1
  • DenseNet201
  • MobileNet V2_1.0

Metrics to measure:

  • Top-1 / Top-5 Accuracy on ImageNet Validation Set
  • Inference time per image
  • Model size in memory (MB)
  • Model size as a serialized file (MB)

Benchmark environment: AWS EC2 C5d.18xlarge (72 vCPUs)

  • To account for variance due to virtualization and shared hardware, I will perform multiple trials by launching new instance(s). I will perform statistical tests to verify if an apparent difference in a metric is significant or due to chance.

To ensure that we are comparing apples to apples, benchmark code (with links to models used) will be published as a GitHub repository.


#2

For consistency, we will use the simple method for quantization: Simply take min and max values of layer outputs as thresholds for quantization.


#3

@hcho3.
Hi
Facing issues at quantized tensorflow model.

tvm.error.OpNotImplemented: The following operators are not supported in frontend TensorFlow: ‘Dequantize’

While running tensorflow model from nnvm frontend.
is there support for post quantization tensorflow models at NNVM?


#4

@Vinayak618 No, currently TVM does not support reading from quantized TensorFlow models. Here is a proposal for supporting quantized TF-Lite models: https://github.com/dmlc/tvm/issues/2351. For now, you’ll have to load the original (non-quantized) model into TVM and use TVM’s quantization tools to perform quantization.


#5

@hcho3.
Thank you for the response.
Can you guide me from where can i find TVM’s quantization tools to apply it on tensorflow model?
I dint find that in the link above.

Also one query not related to the above issue.
Does opt_level in tensorflow NNVM frontend have any significance after opt_level 3.
I’m getting the results even at opt_level 10 so.

Thank you


#6

@Vinayak618 I’m trying to figure out TVM’s quantization pass myself, so I won’t be able to guide you right now. I will put up the benchmark code when it’s done, and you can look at it then. For now, you should look at the pull request https://github.com/dmlc/tvm/pull/2116.

And I think opt_level goes only up to 3.

Ps. If you have other questions, please open a new thread. Let’s keep this thread for discussing the benchmark proposal.


#7

Thank you @hcho3.
Yeah sure will keep it for discussing benchmark proposal.
Once you are done with benchmark code. Please put up the same in this thread. In the mean time ill go through the pull request.

Thank you


#8

@Vinayak618 I just found this script: https://gist.github.com/ZihengJiang/bcabe46a712a417a01a6967d4430b6b5, an example that feeds MXNet model into TVM and run quantization.


#9

Thank you so much @hcho3.
I’ll try it and share the observations.


#10

Hi. @hcho3.
Thank you for referencing the code.
It works great,. I’m able to quantize the model and get the results.

Thank you once again.


#11

@hcho3 thanks for pushing on this

We are currently working on some enhancements to quantization on the TVM side, as some models (DenseNet, MobileNet) need per-channel quantization scale adjustment to avoid catastrophic accuracy loss.

Another issue is that models that use depthwise convolution such as mobilenet will currently see limited speedup vs. floating-point versions because TVM currently lacks schedules for depthwise convolution with NCHWc or NHWC data layouts (preventing vectorization).

Currently the most interesting results will be with Inception and ResNet.


#12

https://github.com/intel/optimized-models/tree/v1.0.6/mxnet/blog/medium_vnni

For MXNet on C5.18xlarge, you can try to use our script to reproduce the result.

For mobilenet v2, some parts of optimization is WIP to upstreaming, will let you know when it’s done.

The published blog as below and you can cite when using our data or script.


#13

Unfortunately, I’ve had other priorities come up at work. I will come back to this at later time.


#14

Hi Eqy,
Has TVM be able to merge dequatinization/quatinization for two consequential layers?
Thank you


#15

If two consecutive layers are quantized, there is no dequantization-requantization between this (you can check the graph after the realize pass to verify this).


#16

Hi,

Do you have met below error:

AttributeError: module ‘tvm.relay’ has no attribute ‘optimize’.

in the line98 from this file: evaluate.py

Thanks


#17

The name spacing has been tweaked in a recent patch. You can find optimize in quantize.py now: https://github.com/dmlc/tvm/blob/master/python/tvm/relay/quantize/quantize.py


#18

Thanks for your reply.

I can run my program now. But looks like the quantized model is slower than non-quantized one.

My platform:

  • Centos 7.2
  • CPU: Intel Xeon E5-2650 v4
  • TVM: git hash 30f757eda1016
  • target in my program: llvm -mcpu=core-avx2
  • LLVM version: 7.0.1
  • MKL Version: 2019.0 Update 5
  • Build config: set(USE_BLAS mkl)

Another thing is: should we update the TVM docs ?


#19

Could you please provide the arguments that you used to run the evaluate.py script?. I am also interested on testing that script.

Also besides that “optimize” error, did you change something else in the original script?

Thanks


#20

Yes, I just changed the target from llvm to llvm -mcpu=core-avx2, and use module.time_evaluate("run", ctx, 100) to timing the latency. Below is my python code:

%E5%BE%AE%E4%BF%A1%E6%88%AA%E5%9B%BE_20190612221115

the batch size is 1, the first 5 batch is used to warm up the graph. I wonder can I add a synchronize api before time_evalute() like mx.nd.watiall() in mxnet?