Current state of quantization effort

I am still trying to grasp the structure of the TVM code base so let me summarize the current state of quantization in TVM to see if I got it right:

  1. There is an existing way of quantizing an FP32 graph, say from tensorflow, through an existing API (relay.quantize.qconfig & relay.quantize.quantize). I tried this path for Mobilenet on several data sizes (int8, int16, int32) and the inference results are not even close to the ones I get using non-quantized values. Performance is also worse, but I assume that could be addressed by using auto-tvm graph optimizations.

  2. There is also an effort underway to support importing a quantized TFLite model into TVM. There seems that Mxnet will be also be supported but not sure if that effort is underway. In both cases the QNN dialect is used to manipulate the input graph into a suitable Relay input.

In both cases the input graphs are either quantized (case 1) or manipulated (case 2) to generate a Relay Int8 graph. My question is related to the format of this Relay Int 8 graph. Assume that I have a quantized tensorflow model, I was thinking of converting the graph into the format that Relay needs, but it seems that the QNN dialect was built for that purpose, see issue #3900, but from that discussion it seems that the operations were built to support TFLite. Is this the
case? Is someone else working on importing quantized Tensorflow models? If so, let me know so that I can join that effort.

Thanks!

Quantized tf model has complex logic need to handle, which has some special ops like FakeQuant. I think we could support it in the future, because currently TFLite has helped us to handle this and we only need to parse quantized TFLite model. TF , TOCO, TFLite is one complete path for supporting tf quantization-aware training.

1 Like

I see your point, it makes sense to have a complete path first. Thanks for the quick response!.

2 Likes

Maybe this could help from the high level understanding viewpoint - Quantization Story

Thanks! I had seen that discussion but had not realized that the effort was to focus on TFLite first.

Basically, there exists two ways to do quantization in TVM. One way is to transform graph that has been quantized by other framework to relay format, and another way it to explore our quantization way with TVM’s ability. I have done some works on the second way, see: https://github.com/dmlc/tvm/pull/3828 . But I don’t have enough time recently to continue working on that. It would be great if anyone can pick it up.

Sorry for the late reply, been busy with work. Ideally I would follow the first path you mentioned because we already have a quantization framework in place, but let me start looking in detail at how quantization works in TVM to see how I can contribute.

1 Like

Hi @FrozenGene,

I was wondering if all the PRs related to the support of prequantized models in TFLite have been merged so that this flow is fully functional?

I saw that recently the following PR was merged:

Yeah, you could run it now. But be careful one thing, though we have the same accuracy as tflite, we can not compare the result with tflite elementwise as said this pr’s comment (https://github.com/dmlc/tvm/pull/3900). Personally I think we should do it as my comment said. I wish I could help to finish it in the near future.

Thanks for the clarification. Although the model accuracy should be the target of an evaluation, I also agree that when evaluating the accuracy of a quantized TFLite model we need to be careful since users might expect to get the same accuracy as in the TFLite implementation. For that reason it would be a good idea to have this “TFLite rounding” in place, that can be used for the TFLite frontend and thus we can fully compare the accuracy with the TFLite implementation

1 Like