Kudos @ziheng for working through all the details and writing a very nice RFC.
I like many aspects that you discuss - Hardware abstraction that lets you choose dtypes for the hardware, the threshold estimation, and converting it into a learning problem. And thanks for raising the topic of debugging. It is extremely hard to debug a quantized network.
I have few concerns/suggestions that might be worth discussing
- Currently, you only have scales and no zero points. I think you are still considering symmetric quantization. Considering, you are going to a do big design change, it might be worth looking into.
- I did not fully understand the
threshold_rectify
requirement. I understand that is because you have to get same scales for the input tensors. But, is it only for simulation? Or are you gonna bring thisrequantize
in realize pass as well?
A major reason I have these questions is because we have a QNN dialect in TVM now, that supports both TFLite and MxNet quantized networks. It has both asymmetric and channel-wise quantization support. So, I believe that it is worth thinking if this is the right time to use QNN ops for realization
. We can make threshold estimation to try all these different quantization techniques, and can rely on QNN for lowering. This will unify the efforts, create a single easy to understand design and avoid duplication. @vinx13 and I also had a quick chat about this integration.
- Minor implementation suggestion - If we are not using KL-Divergence, can we modify the graph to add Relay min/max operators and each edge and then just get all these outputs, instead of storing the whole tensors.