[RFC] Search-based Automated Quantization

adb · January 22, 2020, 10:47pm

Two things I really like here are the inserting SimQ on every edge and adding the Hardware abstraction. We could potentially reuse the Hardware abstraction for easy-to-define custom fusion passes too.

How does the quantization search strategy compare to the current quantization with the same config for resnet18_v1?

I would also be interested to see how an ML learner like XGBoost would compare vs greedy for training time & accuracy.

janimesh · January 23, 2020, 5:40pm

Kudos @ziheng for working through all the details and writing a very nice RFC.

I like many aspects that you discuss - Hardware abstraction that lets you choose dtypes for the hardware, the threshold estimation, and converting it into a learning problem. And thanks for raising the topic of debugging. It is extremely hard to debug a quantized network.

I have few concerns/suggestions that might be worth discussing

Currently, you only have scales and no zero points. I think you are still considering symmetric quantization. Considering, you are going to a do big design change, it might be worth looking into.
I did not fully understand the threshold_rectify requirement. I understand that is because you have to get same scales for the input tensors. But, is it only for simulation? Or are you gonna bring thisrequantize in realize pass as well?

A major reason I have these questions is because we have a QNN dialect in TVM now, that supports both TFLite and MxNet quantized networks. It has both asymmetric and channel-wise quantization support. So, I believe that it is worth thinking if this is the right time to use QNN ops for realization. We can make threshold estimation to try all these different quantization techniques, and can rely on QNN for lowering. This will unify the efforts, create a single easy to understand design and avoid duplication. @vinx13 and I also had a quick chat about this integration.

Minor implementation suggestion - If we are not using KL-Divergence, can we modify the graph to add Relay min/max operators and each edge and then just get all these outputs, instead of storing the whole tensors.

tqchen · January 23, 2020, 11:34pm

As a side note.

One interesting technical discussion to have here is whether asymmetry is a technical debt introduced by the constraint of pre-quantized models. While it is important to support it in QNN to be compatible to existing prequantized models, perhaps we can get away with symmetric quantization(perhaps with channel wise support) if we quantized from fp32 models.

The main reason for such believe is that asymmetry only saves 1 bit— if the min is smaller than 0 max is bigger, we can always represent an asymmetric scheme using symmetric scheme by adding 1bit. On the other hand, asymmetry brings reasonable amount of overhead.

masahi · January 23, 2020, 11:42pm

@janimesh I’ve spent some time with our auto quantization and now I’m working on translating quantized pytorch models to relay via QNN. I really like your proposal for unifying some of the components in our quantization infra

janimesh · January 24, 2020, 12:01am

Yes, I think we need to discuss if asymmetry has a large technical debt compared to benefits we can get from it.

However, my overarching point is more towards integrating QNN and Automatic Quantization as far as realization part goes. We can unify and avoid duplication of efforts. Just mentioned it here because it seems like there is considerable implementation, and we might want to take this point into consideration as well.

tqchen · January 24, 2020, 12:41am

That sounds good.

It would be great to provide a description of alternative solutions, what is the current qnn’s realization strategy, what is the autoQ’s strategy and a proposed unified one.

If there is indeed quite a lot of reusable component then it makes sense to bring things together. It might also provide some insights into what are new designs we want to put to enhance the dialect to support autoQ

ziheng · January 24, 2020, 1:54pm

Hey @adb, yes, reusing the Hardware abstraction is definitely one direction we would like to go! It seems that we don’t have much meaningful features to feed into XGBoost now, but use it to build a latency predictor should be feasible and we can try in the future. I will put more benchmarks when I finish most of the framework work!

tico · January 24, 2020, 1:56pm

@ziheng thanks for bringing this RFC on such an important topic for TVM.

I fully agree that some improvements were required in the current quantization approach in TVM. For example, one of my major problems was that I was never able to find the proper way to set the local scales instead of a global scale, which of course is not the best option.

In general, the idea looks quite promising. At this point I have two comments:

Hardware description: I was wondering what would be the concrete list of attributes of this description?. One important is the actual data types supported by the target hardware. For example, in some cases only INT8 is supported, so of course the quantization flow should be aware of this. Also, is this description going to be an external file or in an internal TVM database, or the user should provide every time the HW attributes?. I am asking this because it would be practical to have this as an external file or in some sort of database in TVM for common platforms.
Dataset: How would be the format of the calibration dataset in the API?. This could be challenging since depending on the model and framework the format of the datasets could vary, for example, in complex data generators in Keras.

BTW, the debug support is a great idea!

ziheng · January 24, 2020, 2:05pm

Hey @janimesh, thanks for those suggestions!

The threshold_rectify will adjust the thresholds during simulation, also will have effect on the real quantized graph, since we use the adjusted thresholds.
I think we will use KL-Distance in the end for accuracy. But yes, we can record some statistics like min/max/mean/var as the sketch of the intermediate results instead of the whole tensor.

Also, unifying the AutoQ and QNN is a good point! I will think about it! @masahi @janimesh

janimesh · January 24, 2020, 4:49pm

@ziheng Sounds good. I will be happy to help in the redesign. I plan to go through AutoQ next week to get a deeper understanding of current implementation, and will think about the integration as well.

adb · January 24, 2020, 7:43pm

BTW some extra quantization info my team has needed in the topi/codegen stage. We’ve needed to hack this in so far as per [Quantization] How to expose 'ndom_scale' 'nclip_min' & 'nclip_max' to TOPI or CodeGen.

Output tensor absolute value ranges (we see output values as accumulators)
Input tensor absolute value ranges
Weight tensor absolute value ranges
weight scale factor

Haven’t worked this one out yet since fusion is involved

Bias tensor absolute value range, if bias is present for nn.dense or nn.conv2d

ziheng · January 28, 2020, 8:46am

Hey @tico, thanks for your suggestion!

Yes, currently the hardware description can specify the data type that it supports, and user can use Python API to declare it. We can also support external file definitely in the future!
Currently I just use a list of dict for the calibration dataset: [{'data': data_arr, 'label': label_arr}, ...]. Yes, we should make a decision about this. Since the calibration dataset should be small, we do not need much complicated design.

adb · February 4, 2020, 12:25am

Hi @ziheng,

How would quantization proceed in this framework when fusion is different between CPU and some hypothetical accelerator? Right now, I believe that default fusion rules are used for both CPU and device. Normally I think we would want CPU to follow the same fusion rules as the accelerator before minimizing KLD?

ziheng · February 4, 2020, 10:59pm

Hey @adb,

I am considering to extend the hardware description and apply it to the fusion pass too. So that we can have different fusion strategy according to the hardware. Currently, since I have noted the SimQ operator as Opache, so fusion will not happen during simulation.

dogwaves · January 6, 2021, 9:50am

quantization process should happen before or after graph high-level optimization which not related with backend device(conv+bn+relu)?

ziheng · January 20, 2021, 12:16am

Before it, so that quantize operators can also be fused.

dogwaves · February 8, 2021, 7:12am

Thanks!

Peng · May 10, 2022, 12:41pm

I want to try to reproduce the quantization process, but I don’t see the relevant implementation in the latest version of the TVM source code, and I can’t find the hago library. Is there any relevant code reference? thanks

ziheng · May 12, 2022, 9:37pm

Hi @Peng , you can try it at here: https://github.com/ZihengJiang/tvm-hago

kuladeepmarupalli · April 14, 2023, 11:10am

Hi @ziheng

Any plans to integrate this to tvm?