How to speed up int8/int16 computing in arm cpu?

In TVM, it seems the input/output data must be the same data type, see the following code in arm_cpu/conv2d:


While in arm neon, there are instructions such as int8 x int8 = int16, int16 x int16 = int32, which can do more computes in a instruction and speed up the computing (8 vs 4 vs 2 multiplies for int8, int16 and int32). The question is is there any methods using these instructions to speed up int8/int16 quantized model in arm cpu?

1 Like

I have the same question as above mentioned

Pre-quantized networks support is still a work in progress in TVM. There are a number of pull requests for this. Further for these cases you mention there are instructions already being generated for vmlal iirc as well as newer support for using the dot product instructions from Armv8.2


There has been some efforts overall. ARM has many variants, where different variants have different levels of integer-8 bits support.

For very recent Dot Product, please take a look at this PR -

For older generations like ARMv7 and early ARMv8, I don’t think we can get significant performance improvement by using int8, because theoretical throughput as per the ISA is not high. Please refer to the talk here -

I have updated to the latest version, there are conv2d_int8 related files in /topi/arm_cpu. Does it work correctly now? And how to set the target to use it?

There would be variations depending on the micro-architecture across the implementation space. While the benefits might not be visible on some of the examples on the slides, the answer may be different on a bigger implementation. There’s usually not a one size fits all answer in the architecture.


I want to have a try, but don’t know how to set the target.

One more question is that tvm.sum’s result dtype is infered by the two operators.

For example of your code, if data_vec and kernel_vec is uint8/int8, the conv dtype will be uint8/int8, the actual compute dtype can’t be changed to int16/int32, so the result will going to be wrong in this case?

@yzhliu Do you have an example that can be shared?

I see the code in arm_cpu/, the func dot_int8_int8_int32 used tensorize to do int8 x int8 = int32. So the dtype can be not the same for input/output?

For example like below, the result C’s dtype is uint8:

A = tvm.placeholder((1080, 1920), name='A', dtype='uint8')
K = tvm.placeholder((3, 3), name='K', dtype='uint8')

ry = tvm.reduce_axis((0, 3), name='ry')
rx = tvm.reduce_axis((0, 3), name='rx')

C = tvm.compute((1080-2, 1920-2), lambda i, j: tvm.sum((A[i + ry, j + rx]*K[ry, rx], axis=[ry, rx]), name='C')
print("C dtype:", C.dtype)

If using astype to cast compute dtype, the result C’s dtype is uint32:

C = tvm.compute((1080-2, 1920-2), lambda i, j: tvm.sum((A[i + ry, j + rx].astype('uint32') *K[ry, rx].astype('uint32')), axis=[ry, rx]), name='C')

But the lower code indicate that the actual multiply-add operation is uint32 with uint32. I doubt that this will stop llvm to do any int8 vectorize optimization if hardware support it.

produce C {
for (i, 0, 1080) {
for (j, 0, 1920) {
C[(((i1918) + j) + -1919)] = (uint32)0
for (ry, 0, 3) {
for (rx, 0, 3) {
if (likely((1 <= i))) {
if (likely((i < 1079))) {
if (likely((1 <= j))) {
if (likely((j < 1919))) {
1918) + j) + -1919)] = (C[(((i*1918) + j) + -1919)] + (uint32(A[(((((i + ry)*1920) + j) + rx) + -1921)])uint32(K[((ry3) + rx)])))


In this case of sum, it looks like it will upcast to int32 to retain precision. If there is a special HW instruction, that internally upcasts from int8 and performs addition, completely hidden from the software, then yes, this explicit upcasting to int32 will lose the opportunity of using that special instruction.

To use dot product, we use a TVM feature called “tensorize” to map a segment of TVM IR to ARM dot product semantics. This is not automatic, this is done manually. The idea is that LLVM might not be able to itself figure all this out. So, lets detect the pattern in TVM IR and directly embed the intrinsic in TVM IR.

This is not the best thing to do, as it is not scalable if we have more of these special HW instructions. However, as of now, limited number of special instructions allow us to use tensorize only at a couple of places.

1 Like

There are potentially many instructions in the instruction set which a vectorizer may not be able to detect. For example urhadd / uhadd might be other interesting instructions to go look at as they end up doing average of 2 8 bit vectors (rounding upwards ((a[i]+b[i] + 1) / 2) vs truncation ((a[i] + b[i]) / 2 ))