Running pre-quantized TFLite models is an ongoing effort. The latest PR https://github.com/dmlc/tvm/pull/3900 tackles quantized Inceptionv1. You can try different models on TFLite hosted repo, I think everything works except InceptionV3
TF(Lite) mobilenet_ssd (even FP32) is little too away to be supported. TF has special control flow operators for SSD that requires lot of complicated work to be done in Relay/TVM (The project is called Relay VM). TVM community is working continuously on that. But, expect atleast a couple of months before that is realizable. After FP32 is supported, I will start looking into int8.
This one seems to be where TFLite test graphs and TFLite operators with Relay are executed.
done in Relay/TVM (like you mentioned , this is called Relay VM)
But it comes short when it comes to handling quantization of operators, it falls short, as I cannot find any function explicitly handling the int8 or fp32 tensor type in the code.
But this code :
this code seems to have a function to handle the various data types and tensors
such as uint32 , fp32 ,uint8 , with a function explicitly deciding between the various tensor types:
get_tensor_type_str(self, tensor_type) """Get tensor type string representation when given TFLite tensor type"""
Can the first code be treated as one handling quantized tflite graphs or can the second one be treated as handling quantized tflite graphs ?
The get_tensor_type_str is just a utility function. It’s presence is not enough to run a quantized model.
For quantized models, most of the operators require special handling because of zero point and scales. This is exactly the idea behind QNN. We are working on some final details and we should have QNN support soon.
My doubt is also about the flow of code in a qnn based tvm to handle quantization. I tried to follow the discussion : TF Lite quantized conv2d operator conversion and got to these situations, are these close to the truth?
Is it that, if we give a quantized (int8) graph as input, then qnn-tvm would first convert graph tensors to qnn tensors by scaling and zp shifting and then compile it on a tvm runtime capable of handling int8 operators as well as fp32 graphs.
Or that given an int8 graph as input, it would be scaled & shifted to FP32 and then converted to qnn tensors and then fed to the same tvm runtime, which was previously used for unquantized graphs as well.
Or, given a fp32 graph as input, it is scaled and zp shifted to int8 with qnn operators and then fed to a tvm runtime, used only for int8 graph handling
Could you please elaborate on a typical flow of how the code consumes and runs an int8/fp32 graph, in case of QNN.
Quantized(INT8) computation is different with FP32. INT8 tensor has scale and zero_point. When we feed the TFLite quantized model to TVM, we will parse it and get the information of scale / zero_point, pass to next computation. So we will not scale it to FP32 and so on. For normal users, I think you doesn’t know any qnn concept and just know when we feed TFLite quantized model, TVM will do it automatically and only do right INT8 arithmetic.
@FrozenGene summarized it correctly. As a TVM user, you should not need to worry whether you are passing a FP32 model or a quantized model.
QNN is a Relay dialect, i.e., it create wrapper ops for handling zero points and scales. We are discussing about including QNN passes in relay.build. It will a few days to resolve this.
I think the three options that you listed, it is the first that we are doing. Again, you should be able to just directly feed a int8 graph and TVM should keep everything in int8 respecting the framework graph. Your pipeline should not change for FP32 vs Int8.
If you are interested, these are the 2 PRs. Once merged, we will have a complete flow for TFLite quantized models