[Quantization] Format of the quantized tensors

After quantizing a model using relay.quantize.quantize and passing it to relay.build I wanted to look at the quantized tensors in the params object returned by relay.build. The params object is a dictionary of {tensor_name: tensor_values}. The size of params is ~1/4 of the non-quantized one since only INT8 values are written into it, or that is what I thought.

Looking at the params object I notice that some entries are listed as dtype=float32, which makes no sense to me since I assume that only INT8 operations are allowed in the final graph. Is there any documentation on what is stored, such as min, max, zp, scale? or are those computed during the quantization process and never stored? I explored the params object without serializing its contents.

Also I notice that the code in relay.save_param_dict(params) and relay.load_param_dict(param_bytes) functions do not check for types and now I wonder if all type information may get lost if the parameters are serialized when writing/reading from a file. When I write the params file and then read it back, using these functions, I see more FP32 tensors compared to the case described above.

Any pointers as to where I should I start looking in the code base?

Thanks!

@vinx13 You might have some idea about this.

min, max, zp, scale are computed during quantization but they are not stored. They are used to generate operations such as mul / add for quantize/requantize. And only part of the quantization params (min, max, zp, scale) are kept as the input for these operations. You can get some hint from the param name in param dict to see what the fp32 params is.

The types are serialized in relay.save_param_dict

Thanks! I’ll look into that. The thing is that the original tensor names are lost after the build process, instead I get a sequential list of numbers (e.g. p1, p2, …). I guess I can map those back to the original names since they are generated top-down as you traverse the graph.

I think I understand what is going on. I generated the relay IR for my test case (mobilenet_v2) and found that some of the operands are FP tensors. I used the Deploy a Quantized Model on Cuda as the starting point for my quantization, so perhaps I am missing some other parameters to tell TVM to fully quantize all of the tensors so all operations are done using integer arithmetic.

In any event that does explain why some of the parameters are in FP format.

Not all operators have a quantize implementation. For example nn.bias_add. If your graph contains these ops then they will be left as float32.

Thanks a lot, that does explain what I see in the Relay IR. I’ll start looking more in depth at the quantization code to fully understand where things stand.

1 Like

Keep in mind this isn’t a limitation of TVM, quantization is still a work-in-progress. These ops will likely have quantize implementations in the future.

Yes, I know TVM is actively working in Quantization. I actually want to contribute to that effort, but have some time constraints, that is why I was looking to what is done and what is missing. If I can align my current work to the needs of TVM I’ll start sending some PRs :slight_smile:

2 Likes