Why vectorize couldn't codegen vfma.f32 instruction on armv7?

I run opt_gemm.py building on a cortex-A7 platform(armv7a), using:
target='llvm -device=arm_cpu -model=xxx -target=armv7a-linux-eabi -mattr=+neon,+thumb2 -mfloat-abi=soft’

I also tried the target as arm-linux-gnueabi the same with it’s gcc -v. The llvm version is 8.0.

The performance is poor, so i export so library and objdump it, i find there are no vfma.f32 intructions. only see inefficient instructions like vmul,vadd.

In contrast, i write a matmul using c language, and compile it with cross compile gcc and -O3, then i can see vfma.f32 in objdump file.

so what’s wrong with it?

@ramana-arm
@FrozenGene
@janimesh
@apivovarov

you should add -mcpu=cortex-a53(set to your arm cpu architecture) to target, then you will get vfma.f32

But actually it’s cortex-A7:sweat:
Anyway, i’ll try it.

What happens if you add -mattr=+neon,vfpv4,thumb2 ?

Ramana

Please don’t do that - you’ll probably end up getting instructions that aren’t supported on Cortex-A7.

I know. I add comment (set to your arm cpu architecture) . A53 is example.:grin:

I have already tried vfp*, tvm error message imply it don’t support this features. but i can see vfp* in cpuinfo.

===>new reply:
I use -mcpu=cortex-a7 combined with +vfp4, it build ok now, but it still couldn’t generate vfma.fp32 instructions.

Yes, it generate vfma.fp32 with -mcpu=cortex-a53, but it don’t work with -mcpu=cortex-a7.
@ramana-arm

TVM doesn’t use fma intrinsic, use
%165 = tail call <32 x float> @llvm.fmuladd.v32f32(<32 x float> %161, <32 x float> %152, <32 x float> %164)
It will check
/// Return true if an FMA operation is faster than a pair of fmul and fadd /// instructions. fmuladd intrinsics will be expanded to FMAs when this method /// returns true, otherwise fmuladd is expanded to fmul + fadd. bool isFMAFasterThanFMulAndFAdd(EVT VT) const override;
Seems LLVM says A7 FMA doesn’t be faster than fmul + fadd

Ok, that’s completely reasonable with -mcpu=cortex-a7 and LLVM choosing that.

BTW when you say performance is poor, what are you comparing against ?

On cortex-a73 and cortex-a7,we comprared tvm gemm performance with hand-optimized assembly gemm.

There always could get closer performance with hand-optimized gemm on cortex-a73, while it is far slower than hand-optimized gemm on cortex-a7.

of course, maybe you doubt that cortex-a7’s hand-optimization is better, but i can see that performance boost is smaller than other arm archi compared with non-optimized(only use gcc -O3) C matmul.

@snowolfhawk If you really think FMA is the reason, I think you could replace llvm intrinsic to FMA.
change

TVM_REGISTER_GLOBAL("tvm.intrin.rule.llvm.fma")
.set_body(DispatchLLVMPureIntrin<::llvm::Intrinsic::fmuladd, 1>); 

to
::llvm::Intrinsic:fma

I have tested for cortex-a7, which could generate vfma.f32 now.

Thank you! I will try it.
Maybe it will get better performance on cortex-a7, both cross gcc’s -O3 and hand-optimized gemm use vfma get good performance.

Refer to llvm-8.0.0.src/lib/Target/ARM/ARMISelLowering.h as your post, Why llvm set it false for all arm32 platform?

I remember someone once told me that some lower arm platform’s vfma instruction isn’t a real combined instruction, it actually split it into vmul and vadd when cpu execute it.

@FrozenGene
@ramana-arm

/// isFMAFasterThanFMulAndFAdd - Return true if an FMA operation is faster
/// than a pair of fmul and fadd instructions. fmuladd intrinsics will be
/// expanded to FMAs when this method returns true, otherwise fmuladd is
/// expanded to fmul + fadd.
/// 
/// ARM supports both fused and unfused multiply-add operations; we already
/// lower a pair of fmul and fadd to the latter so it's not clear that there
/// would be a gain or that the gain would be worthwhile enough to risk
/// correctness bugs.
bool isFMAFasterThanFMulAndFAdd(EVT VT) const override { return false; }

Let me quote the commit msg of LLVM:

Explicitly define ARMISelLowering::isFMAFasterThanFMulAndFAdd. No functionality change.

Currently ARM is the only backend that supports FMA instructions (for at least some subtargets) but does not implement this virtual, so FMAs are never generated except from explicit fma intrinsic calls. Apparently this is due to the fact that it supports both fused (one rounding step) and unfused (two rounding step) multiply + add instructions. This patch clarifies that this the case without changing behavior by implementing the virtual function to simply return false, as the default TargetLoweringBase version does.

It is possible that some cpus perform the fused version faster than the unfused version and vice-versa, so the function implementation should be revisited if hard data is found.

I got about 25% performance boost on cortex-a7 by enabling FMA in gemm case.

change intrinsic fmuladd to fma like I said before?

I just modify llvm and recompile, using intrinsic can’t cover all cases.

In my view that’s an LLVM bug and needs to be fixed there - it would be good if you could submit that upstream especially with the commit message reference to llvm about the default being changed on a per cpu basis.

Ramana

1 Like