QNN - Conv2D/Dense Legalize for platforms with no fast Int8 units

janimesh · November 11, 2019, 7:40am

QNN ops are converted to a sequence of Relay operators before they go through traditional Relay optimization passes. Currently, most of this conversion from QNN ops to Relay ops is suitable for devices that have fast Int8 support in HW. For example, a qnn conv2d is broken down using the following equation

More details at - TF Lite quantized conv2d operator conversion

In the above case, Term1 is convolution with just int8 tensors. This is beneficial for HW platforms that have fast Int8 arithmetic support in HW - Like Intel VNNI, ARM Dot product or Nvidia DP4A. Here, the overhead of Term 3 (Term2 and Term4 are constants) is potentially much smaller than the benefit of fast Int8 arithmetic.

However, the question is - Is this lowering good for HW platforms that do not have fast Int8 arithmetic support - Like old Intel machines or Raspberry Pi 3. We have observed that a separate lowering sequence with lower number of relay operations leads to faster performance with the catch that conv2d inputs now become int16.

The new lowering can be seen from the first equation from the above picture, where we first subtract the zero points from data and weight (leading to int16 precision) and then perform convolution. Specific observation is on raspberry Pi 3 where int16 conv leads to much better performance that int8 conv, due to LLVM better packing of instructions (https://github.com/apache/incubator-tvm/pull/4277). This also binds well with @FrozenGene and @jackwish observations while playing with Integer conv2d schedule for ARM devices.

From code standpoint, specializing this legalization is easy, as we already have a QNN Legalization infrastructure. We can look at target flags to figure out if we have fast HW support - cascadelake for Intel, dotprod for ARM etc. If its not, we can legalize with simpler lowering and LLVM should be able to give us benefits automatically.

What is the downside? The memory footprint takes a hit because the weights are not upcasted to int16.

In case, later we decide that we can write a much better schedule with int8 conv, we can easily disable legalization. So, this is not a 1-way door.

Summary

This RFC discusses a need to do different legalizations for QNN ops depending whether the HW has support for fast Int8 arithmetic operations. Code-wise, we can use the QNN Legalize existing infrastructure.

@jackwish @FrozenGene @ajtulloch @yzhliu @tqchen @vinx13

FrozenGene · November 11, 2019, 8:08am

Thanks @janimesh bringing this RFC.

My prefer way is we do in the legalize pass (checking HW support). If we leave it in the schedule, we also need to do similar things. i.e. if we have dot product instruction on ARM, we will still use dot product and don’t use SMLAL instruction (INT16 * INT16 + INT16 -> INT32). If not, we use SMLAL instruction. This could be cleaner in legalize pass.

One thing we should consider is tensorize, When we do tensorize , we should be careful the dtype is INT16 now. However, I think cast to INT16 make sense to tensorize too. Because even though we accept UINT8, we will still do cast to INT16 and subtract UINT8.

When we want to write better INT8 conv schedule, this still be useful and could be the base. According to our experience on ARM CPU, SMLAL instruction (INT16 * INT16 + INT16 -> INT32) is useful and we shouldn’t change it, we should change other places of schedule to make schedule better, like compute_at and so on.

janimesh · November 11, 2019, 7:53pm

PR - https://github.com/apache/incubator-tvm/pull/4307

janimesh · November 11, 2019, 10:07pm

@FrozenGene You have raised good points However, this RFC is specifically for QNN. Allow me to differentiate

QNN lowering

The RFC is for QNN lowering. There are 2 sequences of Relay operators to choose from. Currently, in Relay codebase, only 1 sequence of lowering is used. This RFC discusses the need for the other lowering for platforms that do not have Int8 support. Here, TOPI schedule will never see an Int8 conv for this lowering.

Non-QNN lowering

This can happen in Automatic Quantization, where the schedule now sees int8 conv. The comments that you are making are completely valid there. The question that you are raising is - if we should do it in Legalize or in schedule. And the decisions that we are making there are kind of similar to the ones that we have to make for QNN lowering. But, I think for now, @jackwish and I think that since we have not explored everything comprehensively, lets keep the changes contained in schedule. Once, we have explored all the options, then we can have a better understanding of division of labor between Relay and schedule.

jackwish · November 12, 2019, 3:58am

Thanks for your sharing @janimesh. Looks like we are making the legalization more modularization. Generally a good design to me - if it doesn’t meant to be a longlive solution .