TF Lite quantized conv2d operator conversion

FrozenGene · May 23, 2019, 6:08am

Please refer here: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0491f/BABDEAGJ.html

int32x4_t  vmlal_s16(int32x4_t a, int16x4_t b, int16x4_t c);    // VMLAL.S16 q0,d0,d0

We don’t have

int32x4_t  vmlal_sxx(int32x4_t a, int8x4_t b, int8x4_t c);    // VMLAL.S16 q0,d0,d0

Insert pad is one option, however, as described in this RFC: https://github.com/dmlc/tvm/issues/2682 the pad can not be fused and lead to worse performance. In option 2, we could make q_conv2d accept input_zero_point, and make pad’s the value is input_zero_point. So, for option 2, if we have q_conv2d's api, the pad issue is not problem.

janimesh · May 23, 2019, 7:55am

Loved this detailed explanation. Thanks. Let me think about this in little more detail. I need to understand the intermediate INT32 register argument.

One other axis that complicates this whole design space I guess is HW support. Intel VNNI performs 4 INT8 reductions in a INT32 register. Given this makes compute very fast, it is unclear without any experiments if Intel machines have similar performance bottlenecks as ARM. I think it would require efforts like yours on the Intel side as well to find the tradeoffs.

janimesh · May 23, 2019, 8:01am

I see. That makes sense.

jackwish · May 23, 2019, 2:13pm

One other axis that complicates this whole design space I guess is HW support

Yes. It’s very important to clarify what is the most important to take into consideration. On one hand, TVM can hardly automatically generates instructions such as VNNI or VMLAL for logic like sum(a[m][k].astype("int32") * b[n][k].astype("int32")) where a and b are INT8 data according to our experiment. (Actually we are very curious how to get this done.) Therefore tensorize seems to be a must for performance, otherwise the performance advantage of quantization cannot outperform FP32 much. On the other hand, tensorize needs to be rewritten for targets of different ISA, including x86 with different SSE/AVX extension, armv7a, armv8a and etc.

Our internal work has different schedules for different purposes, for example a spatial pack as fallback schedule, some tensorize based schedule for performance in different scenarios (single core, multicore core). We’d like to contribute back to community, though it may not happen in near days as far as i can see. But maybe we can share some design insights.

jackwish · May 23, 2019, 2:21pm

consider handling symmetric int8 kernel (fixed 0 offset) and uint8 with 0 offset as specializations.

This requires that the TFLite model provided by user uses a symmetric quantization, which is expected to be generated by TF/TFLite tools. AFAIK, the official pre-trained MobileNetV1 is not. Assuming 0 offset seems a bit aggressive?

also, I believe tflite implementations perform quantized downscale to uint8 feature data between layers. In tflite these are implemented by integer mpy and downshift operations. Some target devices don’t have fp, and especially not dp, so maybe consider supporting the integer implementation as an option.

Yes, enabling integer only devices is one of the targets of Gemmlowp. You are expert on this If we are taking TFLite’s approach, the computing inside q_conv2d are purely integer operation, including the requantization which converts accumulated INT32 result back to INT8. So, no worry about integer implementation

jnorwood · May 23, 2019, 4:27pm

I wouldn’t say expert, but I ported the tflite quantized inception models to a risc-v architecture in my last job.

Note that the tflite implementation uses int64 operations in the integer downscale operations, while some targets do not support that. So your integer implementation may need to allow for that.

The downscale integer operations will also need to do rounding and saturation.

jackwish · May 24, 2019, 1:23am

Note that the tflite implementation uses int64 operations in the integer downscale operations, while some targets do not support that.

That is interesting, so you were handling it on ISA level or C (with complier’s help to emulate int64)?

jnorwood · May 24, 2019, 2:54am

The downscale only stores 8 bit data from the upper word of the int64 result. The downscale constant mpy moves data into the upper word, and then a right shift round, and then saturate and store the uint8 feature value.

The risc-v had a form of multiply instruction that just keeps the upper word of what would have been a 64 bit result… so that provided the needed part for the downscale.

jnorwood · May 24, 2019, 5:49pm

The tflite implementations include an option for selecting reference code. I found it very useful for debug to modify their reference code to do printf output before and after activation, rounding, saturation and I prefixed the output with the c,h,w index values. The prefix c,h,w can be used to sort the lines so that you can match up with your implementation output order is (assuming it also tags and outputs its values). I converted six of the tflite quantized models to risc-v using this.

so, this is just a suggestion, from my own experience, that you provide a similar way to dump the data while trying to match the tflite reference per layer data. Then you can do whatever other optimizations and parallel processing and tune for performance.

I modified tflite operations to support dumping for debug of six models … the four inceptions and two mobilenets. This was back in March, but might be some help. This version was used on ubuntu 18.04.
tflite c++ model printf debug

janimesh · May 24, 2019, 7:11pm

@jackwish @FrozenGene Thanks for explaining the tradeoffs. I got some time to think more deeply about this.

Σc,r,s Q_W(k, c, r, s) × Q_A(n, c, h + r, w + s) // Term 1
- Σc,r,s zp_a × Q_W(k, c, r, s)                     // Term 2
- Σc,r,s zp_w × Q_A(n, c, h + r, w + s)        // Term 3
+ Σc,r,s zp_a × zp_w                                // Term 4

I came up with following conclusions (After reading FBGemm and QNNPACK)

Partial output reuse - Intermediate INT32 value reuse - We should strive to perform all the computation for intermediate INT32 value before going to the memory. @jackwish and @FrozenGene performed this by fusing the core computation (Term 1 to Term4 in the above equation) along with requantize, so that INT32 is always served in the register file and never has to spilled in memory and brought back.
Input reuse - Quantized input matrix reuse - This is also very necessary. If we look at the equation, Term 3 is basically a big offset matrix that has to be applied to Term 1. Both Term 1 and Term 3 share the quantized input matrix. If we perform the calculation in different operators and no fusion, we will not be able to reuse quantized matrix A.

So, in current form, Option 1 does not satisfy both of the above requirements, potentially leaving significant performance opportunity on the table. As performance is primary goal (atleast in my case), I am also leaning towards option 2 as well (with a dread of writing a tensorized schedule). However, I do have a couple of unresolved thoughts that are bothering me.

For option 2, we are performing computations for Term 2 and Term 4 at runtime. I don’t know how much penalty can be avoided by pre-computing that at compile time. This is also a limitation of current TVM infrastructure, where pre-compute/fold-constant is only limited to Relay.
Making a final “weak” case for option 1 here. So, it might be possible to keep 4 terms as separate Relay ops. Precompute Term 2 and Term 4 using Relay passes, solving the problem in point 1. And then somehow perform both horizontal and vertical fusion to fuse everything into one giant op. Then, we can use compute_inline() and compute_at() to perform both input and output reuse. Again, this is “weak” case, this does not sound easy to do. (This is somewhat in line with FBGemm, where they have more granular APIs, where some APIs are executed at compile time).

jackwish · May 25, 2019, 2:25am

If we merge all 4 terms back into its original form, which is something like sum((A - a_zp) * (W - w_zp)), the subtract happens only when we loading input or weights from memory, which may not the bottleneck in practice if there were a good schedule.

Precomputing is good, and QNNPACK’s approach is precomputing Term 2 & 4 and merge them into bias before operator runs (QNNPACK don’t have a compile time, but do have a prepare stage). However, it won’t be easy to do similar work in TVM…

One thing I should mention (hope my comments won’t mislead in any direction) is that, though I have kept saying that reducing INT32 intermedia memory access is important for performance, the fact is that the memory access pattern (schedule) is very significant too.

janimesh · May 25, 2019, 6:33am

Yes, I understand that portion. Loop optimizations and memory access patterns are extremely important for performance.

janimesh · May 25, 2019, 6:34am

Thanks everybody for the great discussion. I will put up an RFC for reading unquantized models from MxNet and TFLite to Relay in next few days.

tqchen · May 28, 2019, 10:12pm

Thanks, everyone for insightful discussions. I also want to point everyone back to the original quantization workflow RFC: https://github.com/dmlc/tvm/issues/2259

There are a few facts that I want to highlight.

There are more than one quantization schemes and different resulting speed-accuracy tradeoffs

“Quantization” is a generic term that has been used for many methods, specifically, there are choices of

Different bitwidth, sign/unsigned in different layers
Symmetric vs asymmetric
Can use floating pt multiplication vs force to only use integer shift

Most frameworks, like TFLite tries to have one opinionated view about the quantized model this. There is a good need to support importing these models, but we also need to keep in mind that adopting directly adopting the implied scheme may not give the best performance.

Sometimes, we might even want to use a mixed-scheme across neural networks. Just like many of the discussions mentioned here. Most troubles are due to the asymetric scheme We should always keep this in mind. There is no one perfect solution and we want to build API to make sure we can cover more

How many quantized ops we want to introduce

As a general rule of thumb, I would recommend us to minimize the number of new operators as well as their names. For example, both quantize and dequantize can likely be converted to subtraction multiplication and cast. This means that we do not want to introduce these operators at least not at the core level. The less new operators we introduce, the easier it is to reuse many of the existing infrastructures. We can likely do a similiar thing for relu(becomes clip) and maxpool(which will be the same, assuming a single zero point). As a matter of fact, giving up asymmetry will enable most of the quantization pipeline fall into the current core operator.

We can, however, introduce dialects that facilitate the conversion, in such case relay.contrib.q_conv2d or is an OK name as long as we try to lower as many of them as possible to the low0level operators.

jackwish · May 29, 2019, 1:58am

It’s a good idea to make it possible in TVM to utilize existing op in a symmetric scheme. I’d like to share some knowledge regarding symmetric/asymmetric, hope it may help the design decision.

TensorFlow/TFLite use asymmetric scheme by default, the pre-trianed quantized MobileNetV1 (which is built from quantization-aware training), though it supports symmetric.
PyTorch/Caffe2/QNNPACK seems to follow the asymmetric approach. (By seem, I mean zero point is essential in code, but there is no detail document stating that.)
TensorRT adopts a symmetric design.

FrozenGene · May 29, 2019, 2:19am

I think introduce a namespace like relay.contrib.quantize is one good solution. We could introduce q_conv2d / q_fully_connected and so on in this namespace.

janimesh · May 29, 2019, 6:51am

I completely agree with this. However, supporting asymmetric quantization might be necessary to keep the accuracy in check. So, I think we need to support asymmetric quantization.

I also like this idea.

I think there are three things we are trying to balance - performance, accuracy and reusing existing compiler infrastructure (or keeping code clean).

For example, an asymmetric quantized_convolution can be a dialect, which we can rewrite using low-level ops (Option 1). As discussed above, this might lead to bad performance but good accuracy and good reuse of existing compiler infrastructure. On the other hand, Option 2 leads to almost best performance and accuracy, but at the expense of a new operator.

As @tqchen aptly said - “There is no one perfect solution and we want to build API to make sure we can cover more”

With that in mind, I think going with contrib ops and a separate namespace might be a good mid-way solution. To be more precise, we won’t have a TVM compute for this contrib op. Instead, we will rewrite these ops to be a sequence of low-level ops. We will add new low-level ops if something is not directly implementable with what we have today. This might take a performance hit but should keep the design modular and simpler.

If I have not forgotten how to do maths, this should satisfy asymmetric quantization requirements
If one does want the best performance, I can envision a Relay pass to fuse the concerned ops into one op and write the schedule for that. This does not sound easy at all. But, something has to take a hit.

@tqchen, please let me know if I understood your comment correctly.
@jackwish and @FrozenGene, given this somewhat mismatches with what you had in mind, please comment if this makes any sense.

tqchen · May 29, 2019, 5:13pm

I want to say that it is not necessary true that asymmetric quantization is necessary to keep accuracy in check. As a matter of fact, if we think about it, asymmetric quantization only will give you at most 1-bit of additional accuracy gain. And usually that is not significant. Symmetric will give you as good accuracy. Due to the impl efficiency difference, we could even implement 16bit integer input instead(which gives much better accuracy) that could be as good as the 8bit input version.

There are also other source of accuracy factors, such as per channel scaling vs the global scaling that affect accuracy more than asymmetry.

So the main reason to support asymmetry is not necessarily for accuracy, but mainly for compatibility concerns

tqchen · May 29, 2019, 5:17pm

Note: asymmetric quantization can be represented directly represented by low level operators like sub multiply and cast.

janimesh · May 30, 2019, 11:36pm

I understand your point. At the same time, it will be bad if we just stick to symmetric quantization, and return models that have been quantized using asymmetric method.

Let’s try to ensure that we can support both if need be. We can certainly start with supporting symmetric quantization to setup and clean the whole flow, while ensuring that we don’t close the path for asymmetric.