[RFC] Using arm intrinsics to implement fixed point multiplication in TVM

fmpq instrinsic sounds good, it would be good to also lookup precedence in terms of the intrinsic name, and discuss a few naming alternatives.

Unfortunately I wasn’t able to find lots of quantized intrinsics around (most of them are for floating points).

Anyway, for the intrinsic I was thinking to something like:

  • mulq
  • mul_q
  • qmul
  • q_mul

I would say to leave the TOPI operator and the Relay operator with the same name, that could be picked among:

  • fixed_point_multiply (as before)
  • fixed_point_multiply_and_scale
  • fixed_point_rescale

I am not extremely good with names, so if you come up with better ideas, I am all ears :slight_smile:

cc: @tqchen, @kparzysz, @anijain2305

Hi all,

I had a go at it, but there is a small issue with accuracy.

The way it was before, everything was done in int64, shifting and casting at the very end.

Now the qmul intrinsic introduces a shift (loosing some precision), a cast to int32 and then I further shift in order to scale (+ round up).

In floating point, this is the equivalent of doing round(round(x*y)/scale) instead of round(x*y/scale).

For some edge cases, e.g (23/16) this produces the wrong value:

  • round(round(23*0.5)/8) = 2
  • round(23*0.5/8)=1.

This is the exact same rounding error I would introduce through arm intrinsics, and should match TFlite. So I think it should be fine, although this is something we have to decide (because I would need to add a +1/-1 tolerance to the tests).

The alternatives to this are:

  • going back to the previous design (where everything was done within the intrinsic)
  • moving everything in TOPI (without adding the intrinsic).

What do you think? @tqchen @anijain2305 @kparzysz

Hi @tqchen, @kparzysz, @anijain2305 Any more thoughts on this?

Thanks

I think it is fine .

1 Like

as long as the behavior is well documented, i agree that it is fine.

1 Like

Sorry for the late reply. I think it is fine too with some minor suggestions.

I would encourage to ensure that things work well with non-ARM platforms as well. We want to ensure that a Relay operator has same output when compiled/executed for ARM and non-ARM machine. Secondly, we want to ensure that this does not severely degrade performance for a non-ARM machine.

The reason I ask this is because we are doing 2 roundings here, and we are thinking backwards from ARM instructions standpoint and writing TVM compute (while traditional way is to first write TVM compute and then find suitable instructions). A non-ARM machine might not have those instructions, and two roundings can degrade performance (hopefully minimal).

Therefore, I ask to ensure that we get good performance for non-ARM machine while maintaining accuracy. Similar to TFLite, there has been some x86 work to match MKLDNN performance and we should ensure that the rough numbers remain same.

So, I had a bit of investigation, and @anijain2305 I think you are right.

In my (general) TOPI operator, I do as follows:

        val = x(*indices)
        val = tvm.tir.if_then_else(left_shift > 0, val << left_shift, val)
        mulq =  tvm.tir.fixed_point_multiply(val, multiplier, 31)
        return (mulq+nudge) >> right_shift 

My hope (as @kparzysz was suggesting) was that the last shift, being a constant, would have been folded with the shift in fixed_point_multiply. However, this doesn’t seem to be the case.

Indeed, when I look at the assembly code generated (without using the overloaded version with the arm intrinsic), I see an additional sshr vX.4s, vX.4s, #const instruction generated, which is exactly the last round-up I am doing.

If I remove the last shift and only return mulq, the sshr instruction disappears. I think LLVM can fold two consecutive shifts, but not a sequence of shift - cast- add -shift. Even if I remove the cast, there is always an add in the middle (that I don’t think I can remove) which makes the shift-fusion not possible.

At this point, I have some doubts for this to be the right approach. Overall, we would end up with something slightly less accurate and slightly slower (in the general case).

While I understand the point of an intrinsic to be as much general as possible, in this case I think specialization should be favored for the sake of performance and accuracy.

What I mean is that instead of introducing the general q_mul intrinsic, we could stick with a more specific intrinsic that does the multiplication and the shift together.

This would be similar to the initial proposal, but I can introduce the q-ness as further input to the intrinsic, in order to generalize it to all q-numbers.

The other alternative can still be the full implementation in TOPI. The reasons I would prefer the intrinsic are :

  1. We can invoke the intrinsic within a compute function (as @kparzysz was mentioning)
  2. It is way easier to overload an intrinsic. To overload in TOPI I would need to add a compute strategy specialization for the TOPI operator, which seems a bit too much
  3. I think multiplying an integer x by a m*2^-n (where m is a q-number) is still general enough to deserve its own TIR instruction.

@tqchen, @anijain2305, @kparzysz what do you think?

Could you show where nudge is coming from?

Hi @kparzysz,

nudge is defined as a constant:

nudge = (1 << (right_shift-1))

Which I use to round toward nearest (right_shift is -s in fpm(x,m,s)). Am I missing something? I wasn’t able to have the tests pass without this.

A possible solution would be to define an intrinsic like (naming tbd):

fpmq(x,y,n,s)

In this way we can achieve q_mul by setting s to zero, but we wouldn’t incur in any of the previous pitfalls.

I think that would make sense.

Ok, I amended the RFC. Please, let me know what you guys think.

Few things to mention:

  • fixed_point_multiply is still the name of the relay (and TOPI) op (which uses the intrinsic under the hood)
  • qmuls is the name of the new intrinsic

Please, let me also know if the naming is fine or you prefer different names.