[RFC] Search-based Automated Quantization

Two things I really like here are the inserting SimQ on every edge and adding the Hardware abstraction. We could potentially reuse the Hardware abstraction for easy-to-define custom fusion passes too.

How does the quantization search strategy compare to the current quantization with the same config for resnet18_v1?

I would also be interested to see how an ML learner like XGBoost would compare vs greedy for training time & accuracy.

Kudos @ziheng for working through all the details and writing a very nice RFC.

I like many aspects that you discuss - Hardware abstraction that lets you choose dtypes for the hardware, the threshold estimation, and converting it into a learning problem. And thanks for raising the topic of debugging. It is extremely hard to debug a quantized network.

I have few concerns/suggestions that might be worth discussing

  • Currently, you only have scales and no zero points. I think you are still considering symmetric quantization. Considering, you are going to a do big design change, it might be worth looking into.
  • I did not fully understand the threshold_rectify requirement. I understand that is because you have to get same scales for the input tensors. But, is it only for simulation? Or are you gonna bring thisrequantize in realize pass as well?

A major reason I have these questions is because we have a QNN dialect in TVM now, that supports both TFLite and MxNet quantized networks. It has both asymmetric and channel-wise quantization support. So, I believe that it is worth thinking if this is the right time to use QNN ops for realization. We can make threshold estimation to try all these different quantization techniques, and can rely on QNN for lowering. This will unify the efforts, create a single easy to understand design and avoid duplication. @vinx13 and I also had a quick chat about this integration.


  • Minor implementation suggestion - If we are not using KL-Divergence, can we modify the graph to add Relay min/max operators and each edge and then just get all these outputs, instead of storing the whole tensors.
1 Like

As a side note.

One interesting technical discussion to have here is whether asymmetry is a technical debt introduced by the constraint of pre-quantized models. While it is important to support it in QNN to be compatible to existing prequantized models, perhaps we can get away with symmetric quantization(perhaps with channel wise support) if we quantized from fp32 models.

The main reason for such believe is that asymmetry only saves 1 bit— if the min is smaller than 0 max is bigger, we can always represent an asymmetric scheme using symmetric scheme by adding 1bit. On the other hand, asymmetry brings reasonable amount of overhead.

@janimesh I’ve spent some time with our auto quantization and now I’m working on translating quantized pytorch models to relay via QNN. I really like your proposal for unifying some of the components in our quantization infra :slight_smile:

1 Like

Yes, I think we need to discuss if asymmetry has a large technical debt compared to benefits we can get from it.

However, my overarching point is more towards integrating QNN and Automatic Quantization as far as realization part goes. We can unify and avoid duplication of efforts. Just mentioned it here because it seems like there is considerable implementation, and we might want to take this point into consideration as well.

That sounds good.

It would be great to provide a description of alternative solutions, what is the current qnn’s realization strategy, what is the autoQ’s strategy and a proposed unified one.

If there is indeed quite a lot of reusable component then it makes sense to bring things together. It might also provide some insights into what are new designs we want to put to enhance the dialect to support autoQ

Hey @adb, yes, reusing the Hardware abstraction is definitely one direction we would like to go! It seems that we don’t have much meaningful features to feed into XGBoost now, but use it to build a latency predictor should be feasible and we can try in the future. I will put more benchmarks when I finish most of the framework work!

1 Like

@ziheng thanks for bringing this RFC on such an important topic for TVM.

I fully agree that some improvements were required in the current quantization approach in TVM. For example, one of my major problems was that I was never able to find the proper way to set the local scales instead of a global scale, which of course is not the best option.

In general, the idea looks quite promising. At this point I have two comments:

  1. Hardware description: I was wondering what would be the concrete list of attributes of this description?. One important is the actual data types supported by the target hardware. For example, in some cases only INT8 is supported, so of course the quantization flow should be aware of this. Also, is this description going to be an external file or in an internal TVM database, or the user should provide every time the HW attributes?. I am asking this because it would be practical to have this as an external file or in some sort of database in TVM for common platforms.

  2. Dataset: How would be the format of the calibration dataset in the API?. This could be challenging since depending on the model and framework the format of the datasets could vary, for example, in complex data generators in Keras.

BTW, the debug support is a great idea!

Hey @janimesh, thanks for those suggestions!

  • The threshold_rectify will adjust the thresholds during simulation, also will have effect on the real quantized graph, since we use the adjusted thresholds.
  • I think we will use KL-Distance in the end for accuracy. But yes, we can record some statistics like min/max/mean/var as the sketch of the intermediate results instead of the whole tensor.

Also, unifying the AutoQ and QNN is a good point! I will think about it! @masahi @janimesh

2 Likes

@ziheng Sounds good. I will be happy to help in the redesign. I plan to go through AutoQ next week to get a deeper understanding of current implementation, and will think about the integration as well.

1 Like

BTW some extra quantization info my team has needed in the topi/codegen stage. We’ve needed to hack this in so far as per [Quantization] How to expose 'ndom_scale' 'nclip_min' & 'nclip_max' to TOPI or CodeGen.

  • Output tensor absolute value ranges (we see output values as accumulators)
  • Input tensor absolute value ranges
  • Weight tensor absolute value ranges
  • weight scale factor

Haven’t worked this one out yet since fusion is involved

  • Bias tensor absolute value range, if bias is present for nn.dense or nn.conv2d
1 Like

Hey @tico, thanks for your suggestion!

  1. Yes, currently the hardware description can specify the data type that it supports, and user can use Python API to declare it. We can also support external file definitely in the future!

  2. Currently I just use a list of dict for the calibration dataset: [{'data': data_arr, 'label': label_arr}, ...]. Yes, we should make a decision about this. Since the calibration dataset should be small, we do not need much complicated design.

1 Like

Hi @ziheng,

How would quantization proceed in this framework when fusion is different between CPU and some hypothetical accelerator? Right now, I believe that default fusion rules are used for both CPU and device. Normally I think we would want CPU to follow the same fusion rules as the accelerator before minimizing KLD?

Hey @adb,

I am considering to extend the hardware description and apply it to the fusion pass too. So that we can have different fusion strategy according to the hardware. Currently, since I have noted the SimQ operator as Opache, so fusion will not happen during simulation.

quantization process should happen before or after graph high-level optimization which not related with backend device(conv+bn+relu)?

Before it, so that quantize operators can also be fused.

Thanks! :coffee: :coffee:

I want to try to reproduce the quantization process, but I don’t see the relevant implementation in the latest version of the TVM source code, and I can’t find the hago library. Is there any relevant code reference? thanks :grinning:

1 Like

Hi @Peng , you can try it at here: https://github.com/ZihengJiang/tvm-hago

2 Likes

Hi @ziheng

Any plans to integrate this to tvm?