INT8 quantization vectorize sum problem


#1

hi, I want to generate int8 sum code(C code), but there is a problem:
In my hardware, there is no instruction int8 + int8 = int16(or int32), so I have to convert the int8 vector to a int32 one. However the bit-width of the vector is fixed, so I have to convert one int8 vector to 4 int32 vectors, and perform 4 adds. Is there any solution?


#2

Hi,

Can you elaborate on what your hardware is? For an example, in ARM NEON, you’ll typically do something like (for an example of 4xint8 + 4xint8 -> 4xint16), either VADDL.S8 (expand both operands to 4xint16, then add them), or if that didn’t exist for some reason you could do VMOVL.S8 + VMOVL.S8 + VADD.S16 (splitting that sequence up into multiple instructions).


#3

hi,
Thanks for reply. My hardware is a DSP, which can perform 512 bits vector operation, and there is no 4xint8 + 4xint8->4xint16, but only something like 4xint32+4xint32 = 4xint32. We do can manully expand an int8 vector to int32, however, because the vector bit-width is fixed, there will be four int32 vector.
64 x int8
to
16xint32 16xint32 16xint32 16xint32
And because we can load 64 data a time, this is the best way to utilize the bandwidth.
The problem is that I need to do four adds every load, while tvm just do one add every load.