FP16 is slower than FP32 in TVM?


#1

I have trained the 16 and 32-bit resnet models respectively and tested them in TVM. The test result is that the calculation of FP32 is faster than that of FP16, and the test code is the same. May I ask why?


#2

@tqchen Do you know who can help me?thank you


#3

If you are using CPU, fp16 will be slow than FP32, because many cpu doesn’t have fp16 type and will call function to do fp16 calculation.


#4

Thanks for your reply!I tested it on a TX2 GPU. I wonder if there is no optimization for FP16 in TVM, so FP16 is slower than FP32.


#5

I think yes, we lack support of fp16 in cuda. Some PR seems doing this: https://github.com/dmlc/tvm/pull/3811/

I also don’t hear any official plan supporting it. I think we should have the plan to tease our fp16 support (codegen, correctness, optimization) on GPU (cuda, opencl, vulkan, opengl, metal)
cc: @tqchen


#6

In fact, I’ve added add operator to support FP16 in the same place you mentioned pr to make sure the build works (which it couldn’t before).There is another question. I am not sure whether TVM has optimized FP32 calculation. If so, do you know where the specific implementation?
code added:
decl_stream << “device half operator+(const volatile __half &a, const volatile __half &b)\n”
“{\n return __hadd(a, b);\n}\n”;


#7

Current schedule on CUDA should only do optimization for FP32 and INT8. The schedule is not considerd for FP16, so I think if it is slow, it does make sense. You could try to benchmark one single layer of convolution and try to modify schedule of FP32 to get better performance on FP16.


#8

thanks a lot, i wll try!


#9

I have made some modifications to the code, and now the calculation result of 16 is slightly better than that of 32, but the expected result is still not obtained. I would like to ask which optimization of FP16 in TVM has not been completed yet?Hopefully it will provide some direction for improvement.thank you @tqchen