FP16 is slower than FP32 in TVM?

I have trained the 16 and 32-bit resnet models respectively and tested them in TVM. The test result is that the calculation of FP32 is faster than that of FP16, and the test code is the same. May I ask why?

@tqchen Do you know who can help me?thank you

If you are using CPU, fp16 will be slow than FP32, because many cpu doesn’t have fp16 type and will call function to do fp16 calculation.

Thanks for your reply!I tested it on a TX2 GPU. I wonder if there is no optimization for FP16 in TVM, so FP16 is slower than FP32.

I think yes, we lack support of fp16 in cuda. Some PR seems doing this: https://github.com/dmlc/tvm/pull/3811/

I also don’t hear any official plan supporting it. I think we should have the plan to tease our fp16 support (codegen, correctness, optimization) on GPU (cuda, opencl, vulkan, opengl, metal)
cc: @tqchen

In fact, I’ve added add operator to support FP16 in the same place you mentioned pr to make sure the build works (which it couldn’t before).There is another question. I am not sure whether TVM has optimized FP32 calculation. If so, do you know where the specific implementation?
code added:
decl_stream << “device half operator+(const volatile __half &a, const volatile __half &b)\n”
“{\n return __hadd(a, b);\n}\n”;

Current schedule on CUDA should only do optimization for FP32 and INT8. The schedule is not considerd for FP16, so I think if it is slow, it does make sense. You could try to benchmark one single layer of convolution and try to modify schedule of FP32 to get better performance on FP16.

thanks a lot, i wll try!

I have made some modifications to the code, and now the calculation result of 16 is slightly better than that of 32, but the expected result is still not obtained. I would like to ask which optimization of FP16 in TVM has not been completed yet?Hopefully it will provide some direction for improvement.thank you @tqchen

@FrozenGene

Do you have any guideline to write a fp16 scheduler for cuda?

I think @FrozenGene Is missleading here. tvm doesn’t accelerate fp16 on any platform. I have tested a exported .so on ARMv8.5 arch device which have fp16 support obviously, but it much more slower than float32 Also the generated fp16 .so is twice bigger than float32.

I have given tvm up at this point, useless.

See ARM FP16 instrin support in M1 chip - #4 by jwfromm

Try adding -mattr=+fullfp16 to your target string.

Well, this thread is dated back to 2019, so I’m sure there is some information less update-to-date.

@denise @zxybazh and I are planning to release some benchmark script of TVM, in the hope that it could avoid potential confusion and frustration in terms of using TVM correctly, so stay tuned!

1 Like