As of now, QNN supports per-tensor quantization, i.e., there is a single scale and zero point for the whole tensor. Prior research has shown that per-channel (aka per-axis) quantization can lead to better accuracy while having minimal impact on performance. In channel quantization, there is scale and zero point for each channel in the tensor.
Channel quantization in frameworks
Both, MxNet-MKLDNN and Tensorflow support channel wise quantization. However, the channel wise quantization support is limited to only few operators and also under some restrictions. These restrictions ensure that the performance degradation is not severe. For example, only weights are considered for channel quantization. Activations/intermediate features maps are still per-tensor quantized. Additionally, whenever a weight tensor is channel quantized, the zero point is 0 for the whole tensor (only scales are per-axis). These restrictions are true for both the frameworks. More details can be found at
- Tensorflow quantization - https://www.tensorflow.org/lite/performance/quantization_spec
- MKDLNN quantization - https://intel.github.io/mkl-dnn/ex_int8_simplenet.html
The above restrictions nicely translate to changes in only 2 QNN operators -
requantize. Both operators take the scale as input expr. The lowering can check whether the input_scale is a vector, and do the channel-wise lowering. Otherwise, we fall back to per-tensor lowering. We also need an
axis argument to tell along which axis the tensor is quantized.
There are not too many changes to support channelwise quantization. This RFC is to get feedback from the community.