Running tf-lite models on TVM

We can run various tf-lite models with TVM, I tried this example, and it worked:

This is the tflite model zoo :

Some tf-lite models use asymmetric quantization, so I’m not sure that those TFlite models will be easily imported into relay or TVM.

So I doubt if I can repeat the same example in tutorial with all tflite examples in the model zoo. If anyone has any idea, it would be greatly helpful.

Running pre-quantized TFLite models is an ongoing effort. The latest PR tackles quantized Inceptionv1. You can try different models on TFLite hosted repo, I think everything works except InceptionV3

Oh okay…
thanks for your reply.

also, I tried a classification model mobilenet_v1 and it could compile on TVM successfully.

But with mobilenet_ssd for object detection, it seems I cannot correctly tackle the data type (dtype) for the input tensor, in the function from_tflite .

Though mobilenet_ssd can work from the GluonCV library , by importing it through from_mxnet fucntion (as in this example: ) , but I wanted to run the graph for a quantized ssd model using the tflite function.

Any idea if mobilenet_ssd model from tflite is supported yet by TVM?

TF(Lite) mobilenet_ssd (even FP32) is little too away to be supported. TF has special control flow operators for SSD that requires lot of complicated work to be done in Relay/TVM (The project is called Relay VM). TVM community is working continuously on that. But, expect atleast a couple of months before that is realizable. After FP32 is supported, I will start looking into int8.

Okayy … thanks for the info!

Also, I had try to run the various examples supported on TVM for TFLite (without QNN), which can handle quantized tflite graphs.

The two python codes I was trying to run…


This one seems to be where TFLite test graphs and TFLite operators with Relay are executed.
done in Relay/TVM (like you mentioned , this is called Relay VM)
But it comes short when it comes to handling quantization of operators, it falls short, as I cannot find any function explicitly handling the int8 or fp32 tensor type in the code.

But this code :

this code seems to have a function to handle the various data types and tensors
such as uint32 , fp32 ,uint8 , with a function explicitly deciding between the various tensor types:

get_tensor_type_str(self, tensor_type)
"""Get tensor type string representation when given TFLite tensor type"""

Can the first code be treated as one handling quantized tflite graphs or can the second one be treated as handling quantized tflite graphs ?

The get_tensor_type_str is just a utility function. It’s presence is not enough to run a quantized model.

For quantized models, most of the operators require special handling because of zero point and scales. This is exactly the idea behind QNN. We are working on some final details and we should have QNN support soon.

Okayy … thanks for the inputs!

So is QNN supported ops the way TVM handles quantized tflite graphs, such as in this code (I found this code in perhaps your tvm repo, in the tflite_parser branch):

My doubt is also about the flow of code in a qnn based tvm to handle quantization. I tried to follow the discussion : TF Lite quantized conv2d operator conversion and got to these situations, are these close to the truth?
Is it that, if we give a quantized (int8) graph as input, then qnn-tvm would first convert graph tensors to qnn tensors by scaling and zp shifting and then compile it on a tvm runtime capable of handling int8 operators as well as fp32 graphs.
Or that given an int8 graph as input, it would be scaled & shifted to FP32 and then converted to qnn tensors and then fed to the same tvm runtime, which was previously used for unquantized graphs as well.
Or, given a fp32 graph as input, it is scaled and zp shifted to int8 with qnn operators and then fed to a tvm runtime, used only for int8 graph handling

Could you please elaborate on a typical flow of how the code consumes and runs an int8/fp32 graph, in case of QNN.


Quantized(INT8) computation is different with FP32. INT8 tensor has scale and zero_point. When we feed the TFLite quantized model to TVM, we will parse it and get the information of scale / zero_point, pass to next computation. So we will not scale it to FP32 and so on. For normal users, I think you doesn’t know any qnn concept and just know when we feed TFLite quantized model, TVM will do it automatically and only do right INT8 arithmetic.

@FrozenGene summarized it correctly. As a TVM user, you should not need to worry whether you are passing a FP32 model or a quantized model.

QNN is a Relay dialect, i.e., it create wrapper ops for handling zero points and scales. We are discussing about including QNN passes in It will a few days to resolve this.

I think the three options that you listed, it is the first that we are doing. Again, you should be able to just directly feed a int8 graph and TVM should keep everything in int8 respecting the framework graph. Your pipeline should not change for FP32 vs Int8.

If you are interested, these are the 2 PRs. Once merged, we will have a complete flow for TFLite quantized models