Benchmarking Quantization on Intel CPU

hcho3 · April 9, 2019, 7:12pm

Background: There has been many papers in academic literature on quantizing weight tensors in deep learning models to reduce inference latency and memory footprint. Also, TVM recently gained the ability to quantize weights (https://github.com/dmlc/tvm/pull/2116)

I am currently working on a systematic benchmark for existing frameworks for (post-training) quantization. A rigorous benchmark will help machine learning practitioners make informed decisions. Any suggestions are welcome.

Frameworks:

TVM
MXNet: quantization example
TensorFlow Lite: quantization tutorial

Models: for now, only Image Classification.

Inception V3
ResNet-50 V1
ResNet-152 V1
DenseNet201
MobileNet V2_1.0

Metrics to measure:

Top-1 / Top-5 Accuracy on ImageNet Validation Set
Inference time per image
Model size in memory (MB)
Model size as a serialized file (MB)

Benchmark environment: AWS EC2 C5d.18xlarge (72 vCPUs)

To account for variance due to virtualization and shared hardware, I will perform multiple trials by launching new instance(s). I will perform statistical tests to verify if an apparent difference in a metric is significant or due to chance.

To ensure that we are comparing apples to apples, benchmark code (with links to models used) will be published as a GitHub repository.

hcho3 · April 9, 2019, 7:27am

For consistency, we will use the simple method for quantization: Simply take min and max values of layer outputs as thresholds for quantization.

Vinayak618 · April 9, 2019, 9:11am

@hcho3.
Hi
Facing issues at quantized tensorflow model.

tvm.error.OpNotImplemented: The following operators are not supported in frontend TensorFlow: ‘Dequantize’

While running tensorflow model from nnvm frontend.
is there support for post quantization tensorflow models at NNVM?

hcho3 · April 9, 2019, 9:25am

@Vinayak618 No, currently TVM does not support reading from quantized TensorFlow models. Here is a proposal for supporting quantized TF-Lite models: https://github.com/dmlc/tvm/issues/2351. For now, you’ll have to load the original (non-quantized) model into TVM and use TVM’s quantization tools to perform quantization.

Vinayak618 · April 9, 2019, 10:10am

@hcho3.
Thank you for the response.
Can you guide me from where can i find TVM’s quantization tools to apply it on tensorflow model?
I dint find that in the link above.

Also one query not related to the above issue.
Does opt_level in tensorflow NNVM frontend have any significance after opt_level 3.
I’m getting the results even at opt_level 10 so.

Thank you

hcho3 · April 9, 2019, 10:18am

@Vinayak618 I’m trying to figure out TVM’s quantization pass myself, so I won’t be able to guide you right now. I will put up the benchmark code when it’s done, and you can look at it then. For now, you should look at the pull request https://github.com/dmlc/tvm/pull/2116.

And I think opt_level goes only up to 3.

Ps. If you have other questions, please open a new thread. Let’s keep this thread for discussing the benchmark proposal.

Vinayak618 · April 9, 2019, 10:20am

Thank you @hcho3.
Yeah sure will keep it for discussing benchmark proposal.
Once you are done with benchmark code. Please put up the same in this thread. In the mean time ill go through the pull request.

Thank you

hcho3 · April 9, 2019, 10:35am

@Vinayak618 I just found this script: https://gist.github.com/ZihengJiang/bcabe46a712a417a01a6967d4430b6b5, an example that feeds MXNet model into TVM and run quantization.

Vinayak618 · April 9, 2019, 11:41am

Thank you so much @hcho3.
I’ll try it and share the observations.

Vinayak618 · April 11, 2019, 5:56am

Hi. @hcho3.
Thank you for referencing the code.
It works great,. I’m able to quantize the model and get the results.

Thank you once again.

eqy · April 11, 2019, 6:29am

@hcho3 thanks for pushing on this

We are currently working on some enhancements to quantization on the TVM side, as some models (DenseNet, MobileNet) need per-channel quantization scale adjustment to avoid catastrophic accuracy loss.

Another issue is that models that use depthwise convolution such as mobilenet will currently see limited speedup vs. floating-point versions because TVM currently lacks schedules for depthwise convolution with NCHWc or NHWC data layouts (preventing vectorization).

Currently the most interesting results will be with Inception and ResNet.

pengzhao-intel · April 17, 2019, 12:36am

https://github.com/intel/optimized-models/tree/v1.0.6/mxnet/blog/medium_vnni

For MXNet on C5.18xlarge, you can try to use our script to reproduce the result.

For mobilenet v2, some parts of optimization is WIP to upstreaming, will let you know when it’s done.

The published blog as below and you can cite when using our data or script.

hcho3 · May 1, 2019, 3:57am

Unfortunately, I’ve had other priorities come up at work. I will come back to this at later time.

Qiu1981 · May 14, 2019, 9:27am

Hi Eqy,
Has TVM be able to merge dequatinization/quatinization for two consequential layers?
Thank you

eqy · May 14, 2019, 8:18pm

If two consecutive layers are quantized, there is no dequantization-requantization between this (you can check the graph after the realize pass to verify this).

TriLoon · June 11, 2019, 9:04am

Hi,

Do you have met below error:

AttributeError: module ‘tvm.relay’ has no attribute ‘optimize’.

in the line98 from this file: evaluate.py

Thanks

eqy · June 11, 2019, 6:58pm

The name spacing has been tweaked in a recent patch. You can find optimize in quantize.py now: https://github.com/dmlc/tvm/blob/master/python/tvm/relay/quantize/quantize.py

TriLoon · June 12, 2019, 2:28am

Thanks for your reply.

I can run my program now. But looks like the quantized model is slower than non-quantized one.

My platform:

Centos 7.2
CPU: Intel Xeon E5-2650 v4
TVM: git hash 30f757eda1016
target in my program: llvm -mcpu=core-avx2
LLVM version: 7.0.1
MKL Version: 2019.0 Update 5
Build config: set(USE_BLAS mkl)

Another thing is: should we update the TVM docs ?

tico · June 12, 2019, 12:47pm

Could you please provide the arguments that you used to run the evaluate.py script?. I am also interested on testing that script.

Also besides that “optimize” error, did you change something else in the original script?

Thanks

TriLoon · June 12, 2019, 2:13pm

Yes, I just changed the target from llvm to llvm -mcpu=core-avx2, and use module.time_evaluate("run", ctx, 100) to timing the latency. Below is my python code:

%E5%BE%AE%E4%BF%A1%E6%88%AA%E5%9B%BE_20190612221115

the batch size is 1, the first 5 batch is used to warm up the graph. I wonder can I add a synchronize api before time_evalute() like mx.nd.watiall() in mxnet?