[RFC] Using arm intrinsics to implement fixed point multiplication in TVM

tqchen · July 2, 2020, 7:37pm

fmpq instrinsic sounds good, it would be good to also lookup precedence in terms of the intrinsic name, and discuss a few naming alternatives.

giuseros · July 3, 2020, 4:07pm

Unfortunately I wasn’t able to find lots of quantized intrinsics around (most of them are for floating points).

Anyway, for the intrinsic I was thinking to something like:

mulq
mul_q
qmul
q_mul

I would say to leave the TOPI operator and the Relay operator with the same name, that could be picked among:

fixed_point_multiply (as before)
fixed_point_multiply_and_scale
fixed_point_rescale

I am not extremely good with names, so if you come up with better ideas, I am all ears

cc: @tqchen, @kparzysz, @anijain2305

giuseros · July 3, 2020, 6:18pm

Hi all,

I had a go at it, but there is a small issue with accuracy.

The way it was before, everything was done in int64, shifting and casting at the very end.

Now the qmul intrinsic introduces a shift (loosing some precision), a cast to int32 and then I further shift in order to scale (+ round up).

In floating point, this is the equivalent of doing round(round(x*y)/scale) instead of round(x*y/scale).

For some edge cases, e.g (23/16) this produces the wrong value:

round(round(23*0.5)/8) = 2
round(23*0.5/8)=1.

This is the exact same rounding error I would introduce through arm intrinsics, and should match TFlite. So I think it should be fine, although this is something we have to decide (because I would need to add a +1/-1 tolerance to the tests).

The alternatives to this are:

going back to the previous design (where everything was done within the intrinsic)
moving everything in TOPI (without adding the intrinsic).

What do you think? @tqchen @anijain2305 @kparzysz

giuseros · July 6, 2020, 2:19pm

Hi @tqchen, @kparzysz, @anijain2305 Any more thoughts on this?

Thanks

kparzysz · July 7, 2020, 1:35pm

I think it is fine .

tqchen · July 7, 2020, 4:11pm

as long as the behavior is well documented, i agree that it is fine.

anijain2305 · July 7, 2020, 8:00pm

Sorry for the late reply. I think it is fine too with some minor suggestions.

I would encourage to ensure that things work well with non-ARM platforms as well. We want to ensure that a Relay operator has same output when compiled/executed for ARM and non-ARM machine. Secondly, we want to ensure that this does not severely degrade performance for a non-ARM machine.

The reason I ask this is because we are doing 2 roundings here, and we are thinking backwards from ARM instructions standpoint and writing TVM compute (while traditional way is to first write TVM compute and then find suitable instructions). A non-ARM machine might not have those instructions, and two roundings can degrade performance (hopefully minimal).

Therefore, I ask to ensure that we get good performance for non-ARM machine while maintaining accuracy. Similar to TFLite, there has been some x86 work to match MKLDNN performance and we should ensure that the rough numbers remain same.

giuseros · July 8, 2020, 11:08am

So, I had a bit of investigation, and @anijain2305 I think you are right.

In my (general) TOPI operator, I do as follows:

        val = x(*indices)
        val = tvm.tir.if_then_else(left_shift > 0, val << left_shift, val)
        mulq =  tvm.tir.fixed_point_multiply(val, multiplier, 31)
        return (mulq+nudge) >> right_shift

My hope (as @kparzysz was suggesting) was that the last shift, being a constant, would have been folded with the shift in fixed_point_multiply. However, this doesn’t seem to be the case.

Indeed, when I look at the assembly code generated (without using the overloaded version with the arm intrinsic), I see an additional sshr vX.4s, vX.4s, #const instruction generated, which is exactly the last round-up I am doing.

If I remove the last shift and only return mulq, the sshr instruction disappears. I think LLVM can fold two consecutive shifts, but not a sequence of shift - cast- add -shift. Even if I remove the cast, there is always an add in the middle (that I don’t think I can remove) which makes the shift-fusion not possible.

At this point, I have some doubts for this to be the right approach. Overall, we would end up with something slightly less accurate and slightly slower (in the general case).

While I understand the point of an intrinsic to be as much general as possible, in this case I think specialization should be favored for the sake of performance and accuracy.

What I mean is that instead of introducing the general q_mul intrinsic, we could stick with a more specific intrinsic that does the multiplication and the shift together.

This would be similar to the initial proposal, but I can introduce the q-ness as further input to the intrinsic, in order to generalize it to all q-numbers.

The other alternative can still be the full implementation in TOPI. The reasons I would prefer the intrinsic are :

We can invoke the intrinsic within a compute function (as @kparzysz was mentioning)
It is way easier to overload an intrinsic. To overload in TOPI I would need to add a compute strategy specialization for the TOPI operator, which seems a bit too much
I think multiplying an integer x by a m*2^-n (where m is a q-number) is still general enough to deserve its own TIR instruction.

@tqchen, @anijain2305, @kparzysz what do you think?

kparzysz · July 8, 2020, 1:41pm

Could you show where nudge is coming from?

giuseros · July 8, 2020, 1:51pm

Hi @kparzysz,

nudge is defined as a constant:

nudge = (1 << (right_shift-1))

Which I use to round toward nearest (right_shift is -s in fpm(x,m,s)). Am I missing something? I wasn’t able to have the tests pass without this.

giuseros · July 8, 2020, 1:53pm

A possible solution would be to define an intrinsic like (naming tbd):

fpmq(x,y,n,s)

In this way we can achieve q_mul by setting s to zero, but we wouldn’t incur in any of the previous pitfalls.

kparzysz · July 8, 2020, 2:54pm

I think that would make sense.

giuseros · July 8, 2020, 4:31pm

Ok, I amended the RFC. Please, let me know what you guys think.

Few things to mention:

fixed_point_multiply is still the name of the relay (and TOPI) op (which uses the intrinsic under the hood)
qmuls is the name of the new intrinsic

Please, let me also know if the naming is fine or you prefer different names.