After quantizing a model using relay.quantize.quantize and passing it to relay.build I wanted to look at the quantized tensors in the params object returned by relay.build. The params object is a dictionary of {tensor_name: tensor_values}. The size of params is ~1/4 of the non-quantized one since only INT8 values are written into it, or that is what I thought.
Looking at the params object I notice that some entries are listed as dtype=float32, which makes no sense to me since I assume that only INT8 operations are allowed in the final graph. Is there any documentation on what is stored, such as min, max, zp, scale? or are those computed during the quantization process and never stored? I explored the params object without serializing its contents.
Also I notice that the code in relay.save_param_dict(params) and relay.load_param_dict(param_bytes) functions do not check for types and now I wonder if all type information may get lost if the parameters are serialized when writing/reading from a file. When I write the params file and then read it back, using these functions, I see more FP32 tensors compared to the case described above.
Any pointers as to where I should I start looking in the code base?
Thanks!