Does the scale parameter for the softmax operator in CMSIS-NN have to be 1/256 and the zero point parameter must be -128?

Zheng-Bicheng · March 27, 2024, 1:13am

Hello @leandron and @lhutton1. I am currently adding tests for the Softmax operator of the Paddle model targeting CMSIS-NN. During several days of testing, I found that not all softmax operators can be offloaded to CMSIS-NN. The specific reasons are as follows:

cmsisnn.py

    def check_qnn_softmax(pattern):
        """Check if softmax is supported by CMSIS-NN."""
        dequantize_call = pattern.args[0].args[0]
        scale = pattern.args[1].data.numpy().item(0)
        zero_point = pattern.args[2].data.numpy().item(0)

        # check for dtypes of quantize and dequantize
        if (
            (scale == 1.0 / 256 and zero_point == -128)
            and pattern.attrs.out_dtype == "int8"
            and dequantize_call.args[0].checked_type.dtype == "int8"
        ):
            return True

        if (
            (scale == 1.0 / 32768 and zero_point == 0)
            and pattern.attrs.out_dtype == "int16"
            and dequantize_call.args[0].checked_type.dtype == "int16"
        ):
            return True
        return False

I found that in the case of int8 quantization, the scale and zero point of the softmax operator must be specified as 1/256 and -128 respectively. As far as I know, whether it’s Torch or Paddle models, the scale and zero point should be user-defined. I don’t understand why there is a mandatory restriction on the scale and zero point.

leandron · March 27, 2024, 9:34am

Hi @Zheng-Bicheng,

This SOFTMAX operator coming from CMSIS-NN supports bit accurate (by design) compliance with TFLite.

These are some interesting documentation links:

Softmax operator in CMSIS-NN: Softmax Functions
Quantization specification in TFLite: TensorFlow Lite 8-bit quantization specification

Note the following restriction in the TFLite softmax:

SOFTMAX
  Input 0:
    data_type  : int8
    range      : [-128, 127]
    granularity: per-tensor
  Output 0:
    data_type  : int8
    range      : [-128, 127]
    granularity: per-tensor
    restriction: (scale, zero_point) = (1.0 / 256.0, -128)

In summary, the restriction is expected, and comes from the bit-accurate support to TFLite.

leandron · March 27, 2024, 9:44am

Also replied on GitHub.

Zheng-Bicheng · March 27, 2024, 10:47am

Thank you for your response. I will attempt to fix the scale and zero point of softmax to 1/256 and -128 respectively while quantizing the Paddle model.